Importance of High Availability
Business Impact
- Downtime can be extremely costly in today’s interconnected world
- Minimizes business disruptions, maintains customer satisfaction, and protects revenue
User Expectations
- Users expect 24/7 service availability
- Poor availability damages reputation and user trust
Critical Systems
- Essential for healthcare, finance, emergency services, and other critical infrastructure
- Directly impacts safety and well-being
Availability Levels (The “9’s”)
| Availability | Downtime per Year | Downtime per Month | Downtime per Week |
|---|---|---|---|
| 90% (one nine) | 36.5 days | 72 hours | 16.8 hours |
| 99% (two nines) | 3.65 days | 7.2 hours | 1.68 hours |
| 99.9% (three nines) | 8.76 hours | 43.8 min | 10.1 min |
| 99.99% (four nines) | 52.6 min | 4.38 min | 1.01 min |
| 99.999% (five nines) | 5.26 min | 25.9 s | 6.06 s |
| 99.9999% (six nines) | 31.56 s | 2.59 s | 0.61 s |
| 99.99999% (seven nines) | 3.16 s | 259 ms | 61 ms |
- Each additional “9” represents an order-of-magnitude reduction in downtime
- Higher availability systems require exponentially more effort and resources
Means to Achieve Dependability
Fault Prevention
- Approach: Prevent occurrence of faults proactively
- Techniques:
- Suitable design patterns
- Rigorous requirements analysis
- Formal verification methods
- Code reviews and static analysis
Fault Tolerance
- Approach: Design systems to continue operation despite faults
- Techniques:
- Redundancy in components and systems
- Error detection mechanisms
- Recovery mechanisms
Fault Removal
- Approach: Identify and reduce existing faults
- Techniques:
- Early prototyping
- Thorough testing
- Static code analysis
- Debugging
Fault Forecasting
- Approach: Predict future fault occurrence and consequences
- Techniques:
- Performance monitoring
- Incident report analysis
- Vulnerability auditing
Foundations of High Availability
Fault Tolerance
Key strategies for fault tolerance:
- Error detection
- Failover mechanisms (error recovery)
- Load balancing
- Redundancy/replication
- Auto-scaling
- Graceful degradation
- Fault isolation
Error Detection in Data Centers
- Monitoring: Collecting metrics like CPU, memory, disk I/O
- Heartbeats for basic health indication
- Threshold monitoring for overload detection
- Telemetry: Analyzing metrics across servers
- Identifying patterns and anomalies
- Detecting potential security threats
- Observability: Understanding internal state through outputs
- Log analysis
- Tracing communications through the system
Circuit Breaker Pattern
- Inspired by electrical circuit breakers
- States: Closed (normal), Open (after failures), Half-open (testing recovery)
- Prevents overload of failing services
- Fails fast rather than degrading under stress
Hardware Error Detection
- ECC Memory: Detects and corrects single-bit errors
- Redundant components: Multiple power supplies, network interfaces
Real-world Examples
- Uber’s M3: Platform for storing and querying time-series metrics
- Netflix’s Mantis: Stream processing of real-time data for monitoring
Failover Strategies
Active-Passive Failover
- Active: Primary system handling all workload
- Passive: Idle standby system synchronized with active
- Failover: When active fails, passive becomes active
- Variations:
- Cold Standby: Needs booting and configuration
- Warm Standby: Running but periodically synchronized
- Hot Standby: Fully synchronized and ready to take over
Active-Active Failover
- Multiple systems simultaneously handling workload
- Load balancer distributes traffic
- When one system fails, others take over
- Provides immediate recovery with no downtime
Decision Factors for Failover Strategy
- State management and consistency requirements
- Recovery Time Objective (RTO)
- Cost constraints
- Operational complexity