Modern cloud architectures are built on several key concepts that address the challenges of building large-scale, distributed, and reliable systems. This note provides an overview of the architectural approaches used in modern cloud systems.

Architectural Foundations

Modern cloud architectures are founded on two fundamental pillars:

  1. Vertical integration - Enhancing capabilities within individual tiers/services
  2. Horizontal scaling - Using multiple commodity computers working together

These pillars have led to significant shifts away from monolithic application architectures toward more distributed approaches.

Architectural Concepts

Layering

  • Definition: Partitioning services vertically into layers

    • Lower layers provide services to higher ones
    • Higher layers unaware of underlying implementation details
    • Low inter-layer dependency
  • Examples:

    • Network protocol stacks (OSI model)
    • Operating systems (kernel, drivers, libraries, GUI)
    • Games (engine, logic, AI, UI)
  • Advantages:

    • Abstraction
    • Reusability
    • Loose coupling
    • Isolated management and testing
    • Supports software evolution

Tiering

  • Definition: Mapping the organization of and within a layer to physical or virtual devices

    • Implies physical location considerations
    • Complements layering
  • Classic Architectures:

    1. 2-tier (client-server): Split layers between client and server
    2. 3-tier: User Interface, Application Logic, Data tiers
    3. n-tier/multi-tier: Further division (e.g., microservices)
  • Advantages:

    • Scalability
    • Availability
    • Flexibility
    • Easier management

Monolith vs. Distributed Architecture

Monolithic Architecture

  • Definition: A single, tightly coupled block of code with all application components
  • Advantages:
    • Simple to develop and deploy
    • Easy to test and debug in early stages
  • Disadvantages:
    • Increasing complexity as application grows
    • Difficult to scale individual components
    • Limited agility with slow and risky deployments
    • Technology lock-in

Distributed Architecture

  • Definition: Application divided into loosely coupled components running on separate servers
  • Advantages:
    • Independent scaling of components
    • Fault isolation
    • Technology diversity
    • Better maintainability
  • Disadvantages:
    • Network communication overhead
    • More complex to manage
    • Distributed debugging challenges

Practical Application Guidelines

When designing cloud architectures:

  1. Foundation matters: Just as buildings need proper foundations, cloud architectures require robust infrastructure layers

  2. Consider scalability & modularity: Employ modular techniques for easier expansion and modification

  3. Focus on resource efficiency: Implement auto-scaling, serverless approaches, and efficient resource allocation

  4. Plan for evolution: Design systems that can adapt to new technologies while maintaining stability

Modern Cloud Architectures - Redundancy

Redundancy is a key design principle in modern cloud architectures that improves fault tolerance, availability, and performance.

Why Use Redundancy?

  • Performance: Distribute workload across multiple replicas to improve response time
  • Error Detection: Compare results when replicas disagree
  • Error Recovery: Switch to backup resources when primary fails
  • Fault Tolerance: System continues functioning despite component failures

Importance of Fault Models

The effectiveness of redundancy depends on how individual replicas fail:

  • For independent crash faults, the availability of a system with n replicas is:

    Availability = 1-p^n
    

    Where p is the probability of individual failure

  • Example: 5 servers each with 90% uptime → overall availability = 1-(0.10)^5 = 99.999%

This only holds if failures are truly independent, which requires consideration of common failure modes.

Redundancy by Replication

Replication involves maintaining multiple copies of:

  • Data
  • Services
  • Infrastructure components

Data Replication

  • Synchronous Replication: Write operations complete only after all replicas are updated

    • Ensures consistency but increases latency
    • Used for critical data where consistency is paramount
  • Asynchronous Replication: Primary replica acknowledges writes before secondaries are updated

    • Better performance but may lose data if primary fails before replication
    • Used when performance is prioritized over consistency
  • Quorum-based Replication: Write operations complete when a majority of replicas acknowledge

    • Balances availability and consistency

Service Replication

  • Active-Passive Replication:

    • One active instance handles all requests
    • Passive instances ready to take over if active fails
    • Lower resource utilization but potential downtime during failover
  • Active-Active Replication:

    • Multiple active instances handle requests simultaneously
    • No downtime during instance failure
    • Requires more complex state management

Infrastructure Redundancy

Modern cloud data centers implement redundancy at multiple levels:

Hardware Redundancy

  • Geographic Redundancy:

    • Data centers distributed across multiple regions
    • Mitigates regional outages from natural disasters, power grid failures
    • Data typically replicated across regions
  • Server Redundancy:

    • Servers deployed in clusters with automatic failover
    • If one server fails, another takes over seamlessly
  • Storage Redundancy:

    • Data replicated across multiple devices and technologies
    • RAID configurations protect against disk failures

Network Redundancy

  1. Server-level Redundancy:

    • Redundant Network Interface Cards (NICs)
    • Dual or more power supplies
  2. Network-level Redundancy:

    • Redundant switches, routers, firewalls, load balancers
  3. Link and Path-level Redundancy:

    • Link aggregation (multiple links between devices)
    • Spanning Tree Protocol to prevent network loops
    • Load balancing across multiple paths

Network topologies designed for redundancy:

  • Hierarchical/3-tier topology
  • Fat-tree/clos topology

Power Redundancy

  • Multiple power feeds from different utility substations
  • Uninterruptible Power Supplies (UPS) for temporary outages
  • Backup generators for medium/long-term outages
  • Power Distribution Units with dual inputs

Cooling Redundancy

  • N+1 configuration (one extra cooling unit than required)
  • Multiple cooling technologies
  • Redundant cooling loops (pipes, heat exchangers, pumps)
  • Hot/cold aisle containment

Redundancy Challenges

  • Cost: Redundant systems require additional hardware and management
  • Complexity: More components mean more potential failure points
  • Consistency: Maintaining consistent state across replicas
  • Testing: Verifying redundancy actually works as expected
Link to original

Modern Cloud Architectures - Scalability

Scaling Fundamentals

Scaling is the process of adding or removing resources to match workload demand. In cloud architectures, two primary scaling approaches are used:

Vertical Scaling (Scaling Up)

  • Definition: Increasing the performance of a single node by adding more resources (CPU cores, memory, etc.)
  • Advantages:
    • Good speedup up to a particular point
    • No application architecture changes required
    • Simpler to implement
  • Disadvantages:
    • Beyond a certain point, speedup becomes very expensive
    • Limited by hardware capabilities
    • Single point of failure remains
    • Potential downtime during scaling operations

Horizontal Scaling (Scaling Out)

  • Definition: Increasing the number of nodes in the system
  • Advantages:
    • Cost-effective way to grow total resources
    • Better fault tolerance through redundancy
    • Virtually unlimited scaling potential
  • Disadvantages:
    • Requires coordination systems and load balancing
    • Application must be designed for distributed operation
    • More complex to efficiently utilize resources

Why Horizontal Scaling Dominates Cloud Architectures

  • Hardware Trend: CPUs are not getting substantially faster as they used to
  • Economic Factor: Large sets of inexpensive commodity servers are more cost-effective
  • Failure Reality: All hardware eventually fails
  • Virtualization Advantage: VMs and containers make it easy to replicate services across nodes

Dynamic Scaling Architecture

Modern cloud systems implement dynamic scaling to automatically adjust resources:

  1. Monitoring: Track metrics like CPU usage, memory usage, request rates
  2. Thresholds: Define conditions that trigger scaling actions
  3. Scaling Actions: Add/remove resources when thresholds are crossed
  4. Stabilization: Implement cooldown periods to prevent oscillation

Example Process Flow:

  1. Consumers send more requests to a service
  2. Existing resources become overloaded, timeouts occur
  3. Auto-scaling detects the condition and deploys additional resources
  4. Traffic is redistributed across all available resources

Scaling and State

Scaling approaches differ based on whether components are stateless or stateful:

Stateless Components

  • Definition: Maintain no internal state beyond processing a single request
  • Examples: Web servers with static content, DNS servers, mathematical calculation services
  • Scaling Approach: Simply create more instances and distribute requests via load balancing

Stateful Components

  • Definition: Maintain state beyond a single request (prior state is required to process future requests)
  • Examples: Database servers, mail servers, stateful web servers, session management
  • Scaling Approach: More complex, typically requires partitioning and/or replication

Stateless Load Balancing

DNS-Level Load Balancing

  • Implementation: DNS servers resolve domain names to different IP addresses
  • Advantages: Simple, cost-effective, can use geographical location
  • Disadvantages: Slow to react to failures due to DNS caching, limited health checks

IP-Level Load Balancing

  • Implementation: Routers direct clients to different locations using IP anycast
  • Advantages: Relatively simple, faster response to failures
  • Disadvantages: Less granular, assumes all requests create equal load

Application-Level Load Balancing

  • Implementation: Dedicated load balancer acting as a front end
  • Advantages: Granular control, content-based routing, SSL offloading
  • Disadvantages: Increased complexity, performance overhead, higher latency

Stateful Scaling

Scaling stateful services presents unique challenges:

Partitioning (Sharding)

  • Definition: Dividing data into distinct, independent parts
  • Purpose: Improves scalability (performance), but not availability
  • Key Consideration: Each data item is stored in only one partition

Partitioning Schemes:

  1. Per-Tenant Partitioning

    • Put different tenants on different machines
    • Good isolation and scalability
    • Challenging when a tenant grows beyond one machine
  2. Horizontal Sharding

    • Split table by rows across different servers
    • Each shard has same schema but contains subset of rows
    • Easy to scale out, reduces indices
    • Examples: Google BigTable, MongoDB
  3. Vertical Partitioning

    • Split table by columns, grouping related columns
    • Improves performance for specific queries
    • Doesn’t inherently support scaling across multiple servers

Distribution Strategies:

  • Range Partitioning

    • Related data stored together
    • Efficient for range queries
    • Poor load balancing, requires manual adjustment
  • Hash Partitioning

    • Uniform distribution
    • Good load balancing
    • Inefficient for range queries
    • Requires reorganization when number of partitions changes
Link to original

Modern Cloud Architectures - Microservices

Evolution from Monolith to Microservices

Traditional monolithic applications face challenges as they grow:

  • Increasingly difficult to maintain
  • Hard to scale specific components
  • Complex to evolve with changing requirements
  • Technology lock-in

Microservices architecture emerged as a solution to these challenges.

What Are Microservices?

Microservices architecture is an approach to develop a single application as a suite of small services, each:

  • Running in its own process
  • Communicating through lightweight mechanisms (often HTTP/REST APIs)
  • Independently deployable
  • Built around business capabilities
  • Potentially implemented using different technologies

Key Characteristics of Microservices

  • Loose coupling: Services interact through well-defined interfaces
  • Independent deployment: Each service can be deployed without affecting others
  • Technology diversity: Different services can use different technologies
  • Focused on business capabilities: Services aligned with business domains
  • Small size: Each service focuses on doing one thing well
  • Decentralized data management: Each service manages its own data
  • Automated deployment: CI/CD pipelines for each service
  • Designed for failure: Resilience built in through isolation

Microservices Architecture Components

A typical microservices architecture includes:

  1. Core Services: Implement business functionality
  2. API Gateway: Provides a single entry point for clients
  3. Service Registry: Keeps track of service instances and locations
  4. Config Server: Centralized configuration management
  5. Monitoring and Tracing: Distributed system observability
  6. Load Balancer: Distributes traffic among service instances

Advantages of Microservices

  1. Independent Development:

    • Teams can work on different services simultaneously
    • Faster development cycles
    • Smaller codebases are easier to understand
  2. Technology Flexibility:

    • Each service can use the most appropriate tech stack
    • Easier to adopt new technologies incrementally
  3. Scalability:

    • Services can be scaled independently based on demand
    • More efficient resource utilization
  4. Fault Isolation:

    • Failures in one service don’t necessarily affect others
    • Easier to implement resilience patterns
  5. Maintainability:

    • Smaller codebases are less complex
    • Easier to understand and debug
    • New team members can become productive faster
  6. Reusability:

    • Services can be reused in different contexts
    • Example: Netflix Asgard, Eureka services used in multiple projects

Disadvantages of Microservices

  1. Complexity:

    • Increased operational overhead with more services to manage and monitor
    • Distributed debugging challenges - tracing issues across multiple services
    • Complexity of service interactions and dependencies
  2. Performance Overhead:

    • Latency due to network communication between services
    • Serialization/deserialization costs
    • Network bandwidth consumption
  3. Operational Challenges:

    • Microservice sprawl - could expand to hundreds or thousands of services
    • Managing CI/CD pipelines for multiple services
    • End-to-end testing becomes more difficult
  4. Failure Patterns:

    • Interdependency chains can cause cascading failures
    • Death spirals (failures in containers of the same service)
    • Retry storms (wasted resources on failed calls)
    • Cascading QoS violations due to bottleneck services
    • Failure recovery potentially slower than in monoliths

Microservice Communication

Synchronous Communication

  • REST APIs (HTTP/HTTPS): Simple request-response pattern
  • gRPC: Efficient binary protocol with bidirectional streaming
  • GraphQL: Query-based, client specifies exactly what data it needs

Pros:

  • Immediate response
  • Simpler to implement
  • Easier to debug

Cons:

  • Tight coupling
  • Higher latency
  • Lower fault tolerance

Asynchronous Communication

  • Message queues: RabbitMQ, ActiveMQ
  • Event streaming: Apache Kafka, AWS Kinesis
  • Pub/Sub pattern: Google Cloud Pub/Sub

Pros:

  • Loose coupling
  • Better scalability
  • Higher fault tolerance

Cons:

  • More complex to implement
  • Harder to debug
  • Eventually consistent

Glueware and Support Infrastructure

Microservices require substantial supporting infrastructure (“glueware”) that often outweighs the core services:

  • Monitoring and logging systems
  • Service discovery mechanisms
  • Load balancing services
  • API gateways
  • Message brokers
  • Circuit breakers for resilience
  • Distributed tracing tools
  • Configuration management

According to the Cloud Native Computing Foundation’s 2022 survey, glueware now outweighs core microservices in most deployments.

Avoiding Microservice Sprawl

To prevent excessive complexity with microservices:

  1. Start with a monolith design

    • Gradually break it down into microservices as needed
    • Identify natural boundaries and avoid over-decomposition
  2. Focus on business capabilities

    • Design around clear business purposes rather than technical functions
  3. Establish clear governance

    • Define guidelines and best practices for microservice development
    • Create standards for naming conventions, communication protocols, etc.
  4. Implement fault-tolerant design patterns

    • Timeouts, bounded retries, circuit breakers
    • Graceful degradation
Link to original