Modern cloud architectures are built on several key concepts that address the challenges of building large-scale, distributed, and reliable systems. This note provides an overview of the architectural approaches used in modern cloud systems.
Architectural Foundations
Modern cloud architectures are founded on two fundamental pillars:
- Vertical integration - Enhancing capabilities within individual tiers/services
- Horizontal scaling - Using multiple commodity computers working together
These pillars have led to significant shifts away from monolithic application architectures toward more distributed approaches.
Architectural Concepts
Layering
-
Definition: Partitioning services vertically into layers
- Lower layers provide services to higher ones
- Higher layers unaware of underlying implementation details
- Low inter-layer dependency
-
Examples:
- Network protocol stacks (OSI model)
- Operating systems (kernel, drivers, libraries, GUI)
- Games (engine, logic, AI, UI)
-
Advantages:
- Abstraction
- Reusability
- Loose coupling
- Isolated management and testing
- Supports software evolution
Tiering
-
Definition: Mapping the organization of and within a layer to physical or virtual devices
- Implies physical location considerations
- Complements layering
-
Classic Architectures:
- 2-tier (client-server): Split layers between client and server
- 3-tier: User Interface, Application Logic, Data tiers
- n-tier/multi-tier: Further division (e.g., microservices)
-
Advantages:
- Scalability
- Availability
- Flexibility
- Easier management
Monolith vs. Distributed Architecture
Monolithic Architecture
- Definition: A single, tightly coupled block of code with all application components
- Advantages:
- Simple to develop and deploy
- Easy to test and debug in early stages
- Disadvantages:
- Increasing complexity as application grows
- Difficult to scale individual components
- Limited agility with slow and risky deployments
- Technology lock-in
Distributed Architecture
- Definition: Application divided into loosely coupled components running on separate servers
- Advantages:
- Independent scaling of components
- Fault isolation
- Technology diversity
- Better maintainability
- Disadvantages:
- Network communication overhead
- More complex to manage
- Distributed debugging challenges
Practical Application Guidelines
When designing cloud architectures:
-
Foundation matters: Just as buildings need proper foundations, cloud architectures require robust infrastructure layers
-
Consider scalability & modularity: Employ modular techniques for easier expansion and modification
-
Focus on resource efficiency: Implement auto-scaling, serverless approaches, and efficient resource allocation
-
Plan for evolution: Design systems that can adapt to new technologies while maintaining stability
Modern Cloud Architectures - Redundancy
Redundancy is a key design principle in modern cloud architectures that improves fault tolerance, availability, and performance.
Why Use Redundancy?
- Performance: Distribute workload across multiple replicas to improve response time
- Error Detection: Compare results when replicas disagree
- Error Recovery: Switch to backup resources when primary fails
- Fault Tolerance: System continues functioning despite component failures
Importance of Fault Models
The effectiveness of redundancy depends on how individual replicas fail:
For independent crash faults, the availability of a system with n replicas is:
Availability = 1-p^nWhere p is the probability of individual failure
Example: 5 servers each with 90% uptime → overall availability = 1-(0.10)^5 = 99.999%
This only holds if failures are truly independent, which requires consideration of common failure modes.
Redundancy by Replication
Replication involves maintaining multiple copies of:
- Data
- Services
- Infrastructure components
Data Replication
Synchronous Replication: Write operations complete only after all replicas are updated
- Ensures consistency but increases latency
- Used for critical data where consistency is paramount
Asynchronous Replication: Primary replica acknowledges writes before secondaries are updated
- Better performance but may lose data if primary fails before replication
- Used when performance is prioritized over consistency
Quorum-based Replication: Write operations complete when a majority of replicas acknowledge
- Balances availability and consistency
Service Replication
Active-Passive Replication:
- One active instance handles all requests
- Passive instances ready to take over if active fails
- Lower resource utilization but potential downtime during failover
Active-Active Replication:
- Multiple active instances handle requests simultaneously
- No downtime during instance failure
- Requires more complex state management
Infrastructure Redundancy
Modern cloud data centers implement redundancy at multiple levels:
Hardware Redundancy
Geographic Redundancy:
- Data centers distributed across multiple regions
- Mitigates regional outages from natural disasters, power grid failures
- Data typically replicated across regions
Server Redundancy:
- Servers deployed in clusters with automatic failover
- If one server fails, another takes over seamlessly
Storage Redundancy:
- Data replicated across multiple devices and technologies
- RAID configurations protect against disk failures
Network Redundancy
Server-level Redundancy:
- Redundant Network Interface Cards (NICs)
- Dual or more power supplies
Network-level Redundancy:
- Redundant switches, routers, firewalls, load balancers
Link and Path-level Redundancy:
- Link aggregation (multiple links between devices)
- Spanning Tree Protocol to prevent network loops
- Load balancing across multiple paths
Network topologies designed for redundancy:
- Hierarchical/3-tier topology
- Fat-tree/clos topology
Power Redundancy
- Multiple power feeds from different utility substations
- Uninterruptible Power Supplies (UPS) for temporary outages
- Backup generators for medium/long-term outages
- Power Distribution Units with dual inputs
Cooling Redundancy
- N+1 configuration (one extra cooling unit than required)
- Multiple cooling technologies
- Redundant cooling loops (pipes, heat exchangers, pumps)
- Hot/cold aisle containment
Redundancy Challenges
Link to original
- Cost: Redundant systems require additional hardware and management
- Complexity: More components mean more potential failure points
- Consistency: Maintaining consistent state across replicas
- Testing: Verifying redundancy actually works as expected
Modern Cloud Architectures - Scalability
Scaling Fundamentals
Scaling is the process of adding or removing resources to match workload demand. In cloud architectures, two primary scaling approaches are used:
Vertical Scaling (Scaling Up)
- Definition: Increasing the performance of a single node by adding more resources (CPU cores, memory, etc.)
- Advantages:
- Good speedup up to a particular point
- No application architecture changes required
- Simpler to implement
- Disadvantages:
- Beyond a certain point, speedup becomes very expensive
- Limited by hardware capabilities
- Single point of failure remains
- Potential downtime during scaling operations
Horizontal Scaling (Scaling Out)
- Definition: Increasing the number of nodes in the system
- Advantages:
- Cost-effective way to grow total resources
- Better fault tolerance through redundancy
- Virtually unlimited scaling potential
- Disadvantages:
- Requires coordination systems and load balancing
- Application must be designed for distributed operation
- More complex to efficiently utilize resources
Why Horizontal Scaling Dominates Cloud Architectures
- Hardware Trend: CPUs are not getting substantially faster as they used to
- Economic Factor: Large sets of inexpensive commodity servers are more cost-effective
- Failure Reality: All hardware eventually fails
- Virtualization Advantage: VMs and containers make it easy to replicate services across nodes
Dynamic Scaling Architecture
Modern cloud systems implement dynamic scaling to automatically adjust resources:
- Monitoring: Track metrics like CPU usage, memory usage, request rates
- Thresholds: Define conditions that trigger scaling actions
- Scaling Actions: Add/remove resources when thresholds are crossed
- Stabilization: Implement cooldown periods to prevent oscillation
Example Process Flow:
- Consumers send more requests to a service
- Existing resources become overloaded, timeouts occur
- Auto-scaling detects the condition and deploys additional resources
- Traffic is redistributed across all available resources
Scaling and State
Scaling approaches differ based on whether components are stateless or stateful:
Stateless Components
- Definition: Maintain no internal state beyond processing a single request
- Examples: Web servers with static content, DNS servers, mathematical calculation services
- Scaling Approach: Simply create more instances and distribute requests via load balancing
Stateful Components
- Definition: Maintain state beyond a single request (prior state is required to process future requests)
- Examples: Database servers, mail servers, stateful web servers, session management
- Scaling Approach: More complex, typically requires partitioning and/or replication
Stateless Load Balancing
DNS-Level Load Balancing
- Implementation: DNS servers resolve domain names to different IP addresses
- Advantages: Simple, cost-effective, can use geographical location
- Disadvantages: Slow to react to failures due to DNS caching, limited health checks
IP-Level Load Balancing
- Implementation: Routers direct clients to different locations using IP anycast
- Advantages: Relatively simple, faster response to failures
- Disadvantages: Less granular, assumes all requests create equal load
Application-Level Load Balancing
- Implementation: Dedicated load balancer acting as a front end
- Advantages: Granular control, content-based routing, SSL offloading
- Disadvantages: Increased complexity, performance overhead, higher latency
Stateful Scaling
Scaling stateful services presents unique challenges:
Partitioning (Sharding)
- Definition: Dividing data into distinct, independent parts
- Purpose: Improves scalability (performance), but not availability
- Key Consideration: Each data item is stored in only one partition
Partitioning Schemes:
Per-Tenant Partitioning
- Put different tenants on different machines
- Good isolation and scalability
- Challenging when a tenant grows beyond one machine
Horizontal Sharding
- Split table by rows across different servers
- Each shard has same schema but contains subset of rows
- Easy to scale out, reduces indices
- Examples: Google BigTable, MongoDB
Vertical Partitioning
- Split table by columns, grouping related columns
- Improves performance for specific queries
- Doesn’t inherently support scaling across multiple servers
Distribution Strategies:
Link to original
Range Partitioning
- Related data stored together
- Efficient for range queries
- Poor load balancing, requires manual adjustment
Hash Partitioning
- Uniform distribution
- Good load balancing
- Inefficient for range queries
- Requires reorganization when number of partitions changes
Modern Cloud Architectures - Microservices
Evolution from Monolith to Microservices
Traditional monolithic applications face challenges as they grow:
- Increasingly difficult to maintain
- Hard to scale specific components
- Complex to evolve with changing requirements
- Technology lock-in
Microservices architecture emerged as a solution to these challenges.
What Are Microservices?
Microservices architecture is an approach to develop a single application as a suite of small services, each:
- Running in its own process
- Communicating through lightweight mechanisms (often HTTP/REST APIs)
- Independently deployable
- Built around business capabilities
- Potentially implemented using different technologies
Key Characteristics of Microservices
- Loose coupling: Services interact through well-defined interfaces
- Independent deployment: Each service can be deployed without affecting others
- Technology diversity: Different services can use different technologies
- Focused on business capabilities: Services aligned with business domains
- Small size: Each service focuses on doing one thing well
- Decentralized data management: Each service manages its own data
- Automated deployment: CI/CD pipelines for each service
- Designed for failure: Resilience built in through isolation
Microservices Architecture Components
A typical microservices architecture includes:
- Core Services: Implement business functionality
- API Gateway: Provides a single entry point for clients
- Service Registry: Keeps track of service instances and locations
- Config Server: Centralized configuration management
- Monitoring and Tracing: Distributed system observability
- Load Balancer: Distributes traffic among service instances
Advantages of Microservices
Independent Development:
- Teams can work on different services simultaneously
- Faster development cycles
- Smaller codebases are easier to understand
Technology Flexibility:
- Each service can use the most appropriate tech stack
- Easier to adopt new technologies incrementally
Scalability:
- Services can be scaled independently based on demand
- More efficient resource utilization
Fault Isolation:
- Failures in one service don’t necessarily affect others
- Easier to implement resilience patterns
Maintainability:
- Smaller codebases are less complex
- Easier to understand and debug
- New team members can become productive faster
Reusability:
- Services can be reused in different contexts
- Example: Netflix Asgard, Eureka services used in multiple projects
Disadvantages of Microservices
Complexity:
- Increased operational overhead with more services to manage and monitor
- Distributed debugging challenges - tracing issues across multiple services
- Complexity of service interactions and dependencies
Performance Overhead:
- Latency due to network communication between services
- Serialization/deserialization costs
- Network bandwidth consumption
Operational Challenges:
- Microservice sprawl - could expand to hundreds or thousands of services
- Managing CI/CD pipelines for multiple services
- End-to-end testing becomes more difficult
Failure Patterns:
- Interdependency chains can cause cascading failures
- Death spirals (failures in containers of the same service)
- Retry storms (wasted resources on failed calls)
- Cascading QoS violations due to bottleneck services
- Failure recovery potentially slower than in monoliths
Microservice Communication
Synchronous Communication
- REST APIs (HTTP/HTTPS): Simple request-response pattern
- gRPC: Efficient binary protocol with bidirectional streaming
- GraphQL: Query-based, client specifies exactly what data it needs
Pros:
- Immediate response
- Simpler to implement
- Easier to debug
Cons:
- Tight coupling
- Higher latency
- Lower fault tolerance
Asynchronous Communication
- Message queues: RabbitMQ, ActiveMQ
- Event streaming: Apache Kafka, AWS Kinesis
- Pub/Sub pattern: Google Cloud Pub/Sub
Pros:
- Loose coupling
- Better scalability
- Higher fault tolerance
Cons:
- More complex to implement
- Harder to debug
- Eventually consistent
Glueware and Support Infrastructure
Microservices require substantial supporting infrastructure (“glueware”) that often outweighs the core services:
- Monitoring and logging systems
- Service discovery mechanisms
- Load balancing services
- API gateways
- Message brokers
- Circuit breakers for resilience
- Distributed tracing tools
- Configuration management
According to the Cloud Native Computing Foundation’s 2022 survey, glueware now outweighs core microservices in most deployments.
Avoiding Microservice Sprawl
To prevent excessive complexity with microservices:
Link to original
Start with a monolith design
- Gradually break it down into microservices as needed
- Identify natural boundaries and avoid over-decomposition
Focus on business capabilities
- Design around clear business purposes rather than technical functions
Establish clear governance
- Define guidelines and best practices for microservice development
- Create standards for naming conventions, communication protocols, etc.
Implement fault-tolerant design patterns
- Timeouts, bounded retries, circuit breakers
- Graceful degradation