paint-brush
Infrastructure Driven Development - Shifting Cloud Complexity Leftby@sathieshveera
263 reads

Infrastructure Driven Development - Shifting Cloud Complexity Left

by Sathiesh VeeraDecember 30th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

While Cloud providers promise simplicity, they have merely transformed the infrastructure complexity rather than eliminating it. The solution isn't more tooling or dedicated cloud platform teams - it's shifting infrastructure considerations to the start of the application development. Success in cloud requires treating infrastructure choices as fundamental architectural decisions that shape how applications are built, and not as deployment time concerns.
featured image - Infrastructure Driven Development - Shifting Cloud Complexity Left
Sathiesh Veera HackerNoon profile picture
0-item



Recently, a team that I know deployed their new code to production using all the right tools. Built in micro-service architecture, the project had 4 different services deployed on AWS Lambda, with their own databases, right IAM policies and everything automated using CFT and CI/CD pipelines, and a dedicated platform team to support them. Yet, in less than 3 months, they were struggling with a bunch of issues, Lambda cold starts, concurrency limits in the account, latency due to network calls etc. When I started thinking back why it happened, it’s clear, this isn’t a story about better tooling, latest technology or even the cloud team’s expertise - it’s about the need to think about infrastructure during the initial application design, not after.


Over the past decade, cloud computing and adoption of cloud infrastructure have drastically increased, across companies of all sizes. This is because the promise is compelling. Cloud providers market “pay per use”, and emphasize on the ease of use. AWS lists "Easy to Use” as the very first benefit of using AWS, and Google Cloud says it "helps developers build quickly…”. While these claims are relatively true, compared to on-premise hosting, the complexity hasn’t disappeared, it has just shifted form.

The Hidden complexity behind Cloud Simplicity

Setting up cloud infrastructure and managing them is complex by design. While the cloud providers have abstracted a lot of low-level details, the complexity has not been completely erased, but only shifted them to the engineers in different forms. Engineers today face several key challenges:

Overwhelming Choices:

Cloud solutions have opened the floodgates with an abundance of choices. Even a seemingly simple decision like choosing a compute service becomes complex because it fundamentally shapes how the whole application is built. For instance,


  • Choosing lambda means designing with stateless functions in mind, chunking workloads to fit within the execution time limits, and also managing connection pools. While Lambda instances are single threaded making thread safety less critical, developers still need to consider connection pooling and static variable handling across invocations, as the execution environment may be reused.


  • Opting for ECS requires thinking about container lifecycle management, service discovery, how the application handles container failures and restarts, managing persistent storage etc.


  • Selecting EC2 means everything from load balancing algorithm, managing sticky sessions, automatic scaling, are part of the design.

Fragmented services and Hidden Dependencies

To host even a simple application, we need to orchestrate a bunch of interconnected services. A Basic web application on EC2 needs Network configurations like VPC, subnets, routing tables, ACLs, Security groups, IAM roles, auto scaling groups, load balancers etc. Each of these components need to work in harmony, and a misconfiguration in any one area can impact the entire system. For instance, a team deployed a few REST apis as Lambdas functions on one account but reused the API Gateway Authorizer from a different aws account. Ignoring one step of setting up the VPCs tunnel left the lambdas in the new account outside of a VPC, making every call between the lambdas, gateway-authorizer, DynamoDb and Elasticache to go over the internet, putting the entire application at a security risk while also increasing the latency of requests.

Configuration Complexity:

Cloud’s promise of abstraction often leads to a false sense of simplicity. In reality engineers must now understand not just traditional infrastructure concepts but also how the cloud providers have wrapped these in layers of web services. For instance, setting up a production read RDS instance requires understanding on


  • Instance class and their performance characteristics
  • Storage types and their implications
  • Backup windows and retention policies
  • Parameter groups and their impact
  • Read replicas and the cost associated with them
  • Security group configurations and access patterns
  • Multi AZ configurations and stand by instances

The Platform team paradox:

Many companies respond to this complexity by creating dedicated Cloud platform teams. While these teams excel at providing standardized infrastructure patterns and self service tools, they cannot abstract away the fundamental need for an application developer to understand the infrastructure implications. For example,


  • A platform team can provide a standardized EKS cluster, but the application developers still need to understand pod lifecycle to properly handle application state.


  • They can setup robust monitoring, but developers need to understand the service behavior to create useful alerts and thresholds.


  • They can provide service templates, but developers need to understand the implications of the service to use them effectively. Consider DynamoDb for example, the platform team provides templates and configurations to create tables and DAX clusters, but the application team should know that DAX is not global and would not provide consistency when using global tables, as dynamo writes would skip DAX leaving stale data in the cache.

The Solution: True Left shift in infrastructure thinking

The solution is not more tools or another layer of abstraction. Instead we need to fundamentally shift how we think about the infrastructure in the application development process.

Infrastructure-Aware Development

Consider AWS lambda cold starts: Rather than treating this as an infrastructure problem to be solved later, teams who understand  this limitation upfront make fundamentally different architectural decisions, such as


  • Lambda could not be the first choice for systems like authentication where sporadic 5-10 seconds delay could result in time sensitive flows like JWT token validation.


  • The code packaging could be more cautious from the start to avoid unnecessary libraries and dependencies that directly impact the image size and cold start times.


  • Even processing pipelines could be designed with Lamdba concurrency limits and connection time in mind.


This shift in thinking from “how to solve cold starts ?” to “how to design a solution with cold starts in mind ?” exemplifies the left shift in infrastructure thinking.

Cost-Aware Architecture Decisions

Infrastructure choices made early in development have lasting implications on operational costs.  For example,


  • Choosing between provisioned concurrency vs on-demand isn't just a technical decision. While provisioned concurrency can eliminate cold starts, it requires accurate capacity planning. A function with steady traffic might cost 40% more with on-demand pricing, but a function with sporadic usage could cost 3x more with poorly planned provisioned concurrency.


  • A DynamoDB table designed without considering read/write capacity units might need significant over provisioning, leading to unnecessary costs. Teams who understand DynamoDB's pricing model might choose to structure their access patterns differently, perhaps using sparse indexes or designing around eventual consistency to reduce provisioned capacity needs.


These cost implications aren't just infrastructure details to be worked out later. They should influence core architectural decisions from the start. A system designed with infrastructure costs in mind often looks very different from one where cost optimization is treated as a post-deployment concern.

Continual Evaluation:

Infrastructure choices shouldn’t be one-time decisions. Teams should regularly evaluate if their infrastructure choices remain valid as their application evolves. This includes regular review of service limits, scaling patterns, monitoring code implications, and assessment of new service offerings that might better suit the evolved requirements.

Avoid Over engineering:

The instinct to over engineer for edge cases often adds unnecessary complexity. The teams that develop applications should simplify the solution both from the application architecture as well as infrastructure architecture. For example, we can use DynamoDB streams to capture change data or over engineer it for any specific use case, and a separate event pipeline with a batch job to do the same work, which complicates the system with unnecessary failure points.

Embracing Opinionated architecture

Rather than trying to keep all options open, teams can embrace opinionated architectural patterns that align with their chosen infrastructure. For example, frameworks like serverless patterns for event-driven architectures reduce decision fatigue and enforce best practices. While this might feel like infrastructure dictating development models, in reality it helps faster development and sets a clear path for future architectural decisions. With standard patterns established like these, when new applications are built, the teams can naturally start solving the problems in a way it fits within the framework and build it the right way from the start, reducing the complexity in adapting new infrastructure and the maintenance overhead.

Investing in Training and knowledge

Infrastructure Knowledge is a must for all engineers, not just the ones in a platform team. For an application developer, choosing an infrastructure should be no different than choosing a programming language, or a database. While understanding every single detailed configuration and parameter of a cloud service might seem daunting, investing time in acquiring knowledge about what the service offers and what are the pros and cons is a must for every engineer to make informed decisions and choices, and proactively embrace them.

Final Thoughts: Infrastructure Driven development

While Cloud services have abstracted many low level details, they remain what their name suggests, i.e. Infrastructure as a service, The complexity still exists, but we can manage it better by shifting the infra considerations upstream in the development process. This is not just about better tools or deeper knowledge and expertise, but about fundamentally changing the way we think about and approach application development in the cloud era.


The most successful teams aren't those with the best infrastructure tools or the largest platform teams - they're the ones who have learned to think about infrastructure as an integral part of their application design process. This is what I call Infrastructure-Driven Development - where infrastructure considerations shape and guide application architecture from day one, rather than being an afterthought to be dealt with during deployment.