Importance of High Availability

Business Impact

  • Downtime can be extremely costly in today’s interconnected world
  • Minimizes business disruptions, maintains customer satisfaction, and protects revenue

User Expectations

  • Users expect 24/7 service availability
  • Poor availability damages reputation and user trust

Critical Systems

  • Essential for healthcare, finance, emergency services, and other critical infrastructure
  • Directly impacts safety and well-being

Availability Levels (The “9’s”)

AvailabilityDowntime per YearDowntime per MonthDowntime per Week
90% (one nine)36.5 days72 hours16.8 hours
99% (two nines)3.65 days7.2 hours1.68 hours
99.9% (three nines)8.76 hours43.8 min10.1 min
99.99% (four nines)52.6 min4.38 min1.01 min
99.999% (five nines)5.26 min25.9 s6.06 s
99.9999% (six nines)31.56 s2.59 s0.61 s
99.99999% (seven nines)3.16 s259 ms61 ms
  • Each additional “9” represents an order-of-magnitude reduction in downtime
  • Higher availability systems require exponentially more effort and resources

Means to Achieve Dependability

Fault Prevention

  • Approach: Prevent occurrence of faults proactively
  • Techniques:
    • Suitable design patterns
    • Rigorous requirements analysis
    • Formal verification methods
    • Code reviews and static analysis

Fault Tolerance

  • Approach: Design systems to continue operation despite faults
  • Techniques:
    • Redundancy in components and systems
    • Error detection mechanisms
    • Recovery mechanisms

Fault Removal

  • Approach: Identify and reduce existing faults
  • Techniques:
    • Early prototyping
    • Thorough testing
    • Static code analysis
    • Debugging

Fault Forecasting

  • Approach: Predict future fault occurrence and consequences
  • Techniques:
    • Performance monitoring
    • Incident report analysis
    • Vulnerability auditing

Foundations of High Availability

Fault Tolerance

Key strategies for fault tolerance:

  • Error detection
  • Failover mechanisms (error recovery)
  • Load balancing
  • Redundancy/replication
  • Auto-scaling
  • Graceful degradation
  • Fault isolation

Error Detection in Data Centers

  • Monitoring: Collecting metrics like CPU, memory, disk I/O
    • Heartbeats for basic health indication
    • Threshold monitoring for overload detection
  • Telemetry: Analyzing metrics across servers
    • Identifying patterns and anomalies
    • Detecting potential security threats
  • Observability: Understanding internal state through outputs
    • Log analysis
    • Tracing communications through the system

Circuit Breaker Pattern

  • Inspired by electrical circuit breakers
  • States: Closed (normal), Open (after failures), Half-open (testing recovery)
  • Prevents overload of failing services
  • Fails fast rather than degrading under stress

Hardware Error Detection

  • ECC Memory: Detects and corrects single-bit errors
  • Redundant components: Multiple power supplies, network interfaces

Real-world Examples

  • Uber’s M3: Platform for storing and querying time-series metrics
  • Netflix’s Mantis: Stream processing of real-time data for monitoring

Failover Strategies

Active-Passive Failover

  • Active: Primary system handling all workload
  • Passive: Idle standby system synchronized with active
  • Failover: When active fails, passive becomes active
  • Variations:
    • Cold Standby: Needs booting and configuration
    • Warm Standby: Running but periodically synchronized
    • Hot Standby: Fully synchronized and ready to take over

Active-Active Failover

  • Multiple systems simultaneously handling workload
  • Load balancer distributes traffic
  • When one system fails, others take over
  • Provides immediate recovery with no downtime

Decision Factors for Failover Strategy

  • State management and consistency requirements
  • Recovery Time Objective (RTO)
  • Cost constraints
  • Operational complexity