Modern Cloud Architectures

Redundancy is a key design principle in modern cloud architectures that improves fault tolerance, availability, and performance.

Why Use Redundancy?

Performance: Distribute workload across multiple replicas to improve response time
Error Detection: Compare results when replicas disagree
Error Recovery: Switch to backup resources when primary fails
Fault Tolerance: System continues functioning despite component failures

The effectiveness of redundancy depends on how individual replicas fail:

For independent crash faults, the availability of a system with n replicas is:
```
Availability = 1-p^n
```
Where p is the probability of individual failure
Example: 5 servers each with 90% uptime → overall availability = 1-(0.10)^5 = 99.999%

This only holds if failures are truly independent, which requires consideration of common failure modes.

Replication involves maintaining multiple copies of:

Synchronous Replication: Write operations complete only after all replicas are updated
- Ensures consistency but increases latency
- Used for critical data where consistency is paramount
Asynchronous Replication: Primary replica acknowledges writes before secondaries are updated
- Better performance but may lose data if primary fails before replication
- Used when performance is prioritized over consistency
Quorum-based Replication: Write operations complete when a majority of replicas acknowledge
- Balances availability and consistency

Active-Passive Replication:
- One active instance handles all requests
- Passive instances ready to take over if active fails
- Lower resource utilization but potential downtime during failover
Active-Active Replication:
- Multiple active instances handle requests simultaneously
- No downtime during instance failure
- Requires more complex state management

Modern cloud data centers implement redundancy at multiple levels:

Geographic Redundancy:
- Data centers distributed across multiple regions
- Mitigates regional outages from natural disasters, power grid failures
- Data typically replicated across regions
Server Redundancy:
- Servers deployed in clusters with automatic failover
- If one server fails, another takes over seamlessly
Storage Redundancy:
- Data replicated across multiple devices and technologies
- RAID configurations protect against disk failures

Server-level Redundancy:
- Redundant Network Interface Cards (NICs)
- Dual or more power supplies
Network-level Redundancy:
- Redundant switches, routers, firewalls, load balancers
Link and Path-level Redundancy:
- Link aggregation (multiple links between devices)
- Spanning Tree Protocol to prevent network loops
- Load balancing across multiple paths

Network topologies designed for redundancy: