High Availability

Importance of High Availability

Business Impact

Downtime can be extremely costly in today’s interconnected world
Minimizes business disruptions, maintains customer satisfaction, and protects revenue

User Expectations

Users expect 24/7 service availability
Poor availability damages reputation and user trust

Critical Systems

Essential for healthcare, finance, emergency services, and other critical infrastructure
Directly impacts safety and well-being

Availability Levels (The “9’s”)

Availability	Downtime per Year	Downtime per Month	Downtime per Week
90% (one nine)	36.5 days	72 hours	16.8 hours
99% (two nines)	3.65 days	7.2 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.8 min	10.1 min
99.99% (four nines)	52.6 min	4.38 min	1.01 min
99.999% (five nines)	5.26 min	25.9 s	6.06 s
99.9999% (six nines)	31.56 s	2.59 s	0.61 s
99.99999% (seven nines)	3.16 s	259 ms	61 ms

Each additional “9” represents an order-of-magnitude reduction in downtime
Higher availability systems require exponentially more effort and resources

Means to Achieve Dependability

Fault Prevention

Approach: Prevent occurrence of faults proactively
Techniques:
- Suitable design patterns
- Rigorous requirements analysis
- Formal verification methods
- Code reviews and static analysis

Fault Tolerance

Approach: Design systems to continue operation despite faults
Techniques:
- Redundancy in components and systems
- Error detection mechanisms
- Recovery mechanisms

Fault Removal

Approach: Identify and reduce existing faults
Techniques:
- Early prototyping
- Thorough testing
- Static code analysis
- Debugging

Fault Forecasting

Approach: Predict future fault occurrence and consequences
Techniques:
- Performance monitoring
- Incident report analysis
- Vulnerability auditing

Foundations of High Availability

Fault Tolerance

Key strategies for fault tolerance:

Error detection
Failover mechanisms (error recovery)
Load balancing
Redundancy/replication
Auto-scaling
Graceful degradation
Fault isolation

Error Detection in Data Centers

Monitoring: Collecting metrics like CPU, memory, disk I/O
- Heartbeats for basic health indication
- Threshold monitoring for overload detection
Telemetry: Analyzing metrics across servers
- Identifying patterns and anomalies
- Detecting potential security threats
Observability: Understanding internal state through outputs
- Log analysis
- Tracing communications through the system

Circuit Breaker Pattern

Inspired by electrical circuit breakers
States: Closed (normal), Open (after failures), Half-open (testing recovery)
Prevents overload of failing services
Fails fast rather than degrading under stress

Hardware Error Detection

ECC Memory: Detects and corrects single-bit errors
Redundant components: Multiple power supplies, network interfaces

Real-world Examples

Uber’s M3: Platform for storing and querying time-series metrics
Netflix’s Mantis: Stream processing of real-time data for monitoring

Failover Strategies

Active-Passive Failover

Active: Primary system handling all workload
Passive: Idle standby system synchronized with active
Failover: When active fails, passive becomes active
Variations:
- Cold Standby: Needs booting and configuration
- Warm Standby: Running but periodically synchronized
- Hot Standby: Fully synchronized and ready to take over

Active-Active Failover

Multiple systems simultaneously handling workload
Load balancer distributes traffic
When one system fails, others take over
Provides immediate recovery with no downtime

Decision Factors for Failover Strategy

State management and consistency requirements
Recovery Time Objective (RTO)
Cost constraints
Operational complexity