Redundancy is a key design principle in modern cloud architectures that improves fault tolerance, availability, and performance.

Why Use Redundancy?

  • Performance: Distribute workload across multiple replicas to improve response time
  • Error Detection: Compare results when replicas disagree
  • Error Recovery: Switch to backup resources when primary fails
  • Fault Tolerance: System continues functioning despite component failures

Importance of Fault Models

The effectiveness of redundancy depends on how individual replicas fail:

  • For independent crash faults, the availability of a system with n replicas is:

    Availability = 1-p^n
    

    Where p is the probability of individual failure

  • Example: 5 servers each with 90% uptime → overall availability = 1-(0.10)^5 = 99.999%

This only holds if failures are truly independent, which requires consideration of common failure modes.

Redundancy by Replication

Replication involves maintaining multiple copies of:

  • Data
  • Services
  • Infrastructure components

Data Replication

  • Synchronous Replication: Write operations complete only after all replicas are updated

    • Ensures consistency but increases latency
    • Used for critical data where consistency is paramount
  • Asynchronous Replication: Primary replica acknowledges writes before secondaries are updated

    • Better performance but may lose data if primary fails before replication
    • Used when performance is prioritized over consistency
  • Quorum-based Replication: Write operations complete when a majority of replicas acknowledge

    • Balances availability and consistency

Service Replication

  • Active-Passive Replication:

    • One active instance handles all requests
    • Passive instances ready to take over if active fails
    • Lower resource utilization but potential downtime during failover
  • Active-Active Replication:

    • Multiple active instances handle requests simultaneously
    • No downtime during instance failure
    • Requires more complex state management

Infrastructure Redundancy

Modern cloud data centers implement redundancy at multiple levels:

Hardware Redundancy

  • Geographic Redundancy:

    • Data centers distributed across multiple regions
    • Mitigates regional outages from natural disasters, power grid failures
    • Data typically replicated across regions
  • Server Redundancy:

    • Servers deployed in clusters with automatic failover
    • If one server fails, another takes over seamlessly
  • Storage Redundancy:

    • Data replicated across multiple devices and technologies
    • RAID configurations protect against disk failures

Network Redundancy

  1. Server-level Redundancy:

    • Redundant Network Interface Cards (NICs)
    • Dual or more power supplies
  2. Network-level Redundancy:

    • Redundant switches, routers, firewalls, load balancers
  3. Link and Path-level Redundancy:

    • Link aggregation (multiple links between devices)
    • Spanning Tree Protocol to prevent network loops
    • Load balancing across multiple paths

Network topologies designed for redundancy:

  • Hierarchical/3-tier topology
  • Fat-tree/clos topology

Power Redundancy

  • Multiple power feeds from different utility substations
  • Uninterruptible Power Supplies (UPS) for temporary outages
  • Backup generators for medium/long-term outages
  • Power Distribution Units with dual inputs

Cooling Redundancy

  • N+1 configuration (one extra cooling unit than required)
  • Multiple cooling technologies
  • Redundant cooling loops (pipes, heat exchangers, pumps)
  • Hot/cold aisle containment

Redundancy Challenges

  • Cost: Redundant systems require additional hardware and management
  • Complexity: More components mean more potential failure points
  • Consistency: Maintaining consistent state across replicas
  • Testing: Verifying redundancy actually works as expected