• Distributed System Fundamentals

    What Is a Distributed System?

    A distributed system can be defined in several ways:

    • Tanenbaum and van Steen: “A collection of independent computers that appears to its users as a single coherent system”

    • Coulouris, Dollimore and Kindberg: “One in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages”

    • Lamport: “One that stops you getting work done when a machine you’ve never even heard of crashes”

    Motivations for Distributed Systems

    1. Geographic Distribution: Resources and users are naturally distributed
      • Example: Banking services accessible from different locations while data is centrally stored
    2. Fault Tolerance: Problems rarely affect multiple locations simultaneously
      • Multiple database servers in different rooms provide better reliability
    3. Performance and Scalability: Combining resources for enhanced capabilities
      • High Performance Computing, replicated web servers, etc.

    Examples of Distributed Systems

    • Financial trading platforms
    • Web search engines (processing 50+ billion web pages)
    • Social media platforms supporting billions of users
    • Large Language Models (trained across clusters)
    • Scientific research (e.g., CERN with over 1 Exabyte of data)
    • Content Delivery Networks (CDNs)
    • Online multiplayer games

    Fallacies of Distributed Computing

    Eight classic assumptions that often lead to problematic distributed systems designs (identified at Sun Microsystems):

    1. The network is reliable
    2. Latency is zero
    3. Bandwidth is infinite
    4. The network is secure
    5. There is one administrator
    6. Transport cost is zero
    7. The network is homogeneous
    8. Topology doesn’t change

    Key Aspects of Distributed System Design

    • System Function: The intended purpose (features and capabilities)
    • System Behavior: How the system performs its functions
    • Quality Attributes: Core qualities determining success:
      • Performance
      • Cost
      • Security
      • Dependability

    Challenges in Distributed Systems

    Distributed systems introduce complexity in:

    • Coordination
    • Consistency
    • Fault detection and recovery
    • Security
    • Performance optimization
    Link to original
  • Cloud Systems Quality Attributes

    Quality attributes are non-functional requirements that determine the success of a cloud system beyond its basic functionality.

    Core Quality Attributes

    1. Performance

    • Workload handling: Capacity to process the required volume of operations
    • Efficiency: Resource usage in relation to output
    • Responsiveness: Speed of response to user requests or events
    • Throughput: Total amount of work accomplished in a given time period
    • Latency: Time delay between action and response

    2. Cost

    • Build/deployment costs: Initial setup expenses
    • Operational costs: Ongoing expenses to run the system
    • Maintenance costs: Expenses for updates, fixes, and improvements
    • Resource optimization: Efficient use of hardware, software, and human resources
    • Scaling costs: Expenses related to growth or contraction

    3. Security

    • Access control: Prevention of unauthorized access
    • Data protection: Safeguarding sensitive information
    • Integrity: Ensuring data remains uncorrupted
    • Confidentiality: Keeping private information private
    • Compliance: Meeting regulatory requirements

    4. Dependability

    • Availability: Readiness for correct service
    • Reliability: Continuity of correct service
    • Safety: Freedom from catastrophic consequences
    • Integrity: Absence of improper system alterations
    • Maintainability: Ability to undergo repairs and modifications

    Service and Failure Concepts

    Correct Service vs. Failure

    • Correct service: System implements its function as specified
    • Failure: Deviation from the functional specification
      • Not binary but exists on a spectrum from optimal to complete failure

    Quality of Service (QoS)

    • A measure of how well a system performs
    • The ability to provide guaranteed performance levels
    • Multiple dimensions: latency, bandwidth, security, availability, etc.
    • Highly contextual and defined for specific applications
    • Goal: Highest QoS despite faults at the lowest cost

    Potential Failure Sources in Datacenters

    Hardware Failures

    • Node/server failures (crashes, timing issues, data corruption)
    • Power failures (crashes, possible data corruption)
    • Physical accidents (fire, flood, earthquakes)

    Network Failures

    • Router/gateway failures affecting entire subnets
    • Name server failures impacting name domains
    • Network congestion leading to dropped packets

    Software and Human Factors

    • Software complexity leading to bugs
    • Misconfiguration and human error
    • Security attacks (both external and internal)

    Real-world Datacenter Failures

    • 2008: Amazon S3 major outages affecting US & EU
    • 2011: Amazon EBS and RDS outage lasting 4 days
    • 2015: Apple service disruptions (iTunes, iCloud, Photos)
    • 2016: Google Cloud Platform significant outage
    • 2021: OVHcloud fire destroying datacenters in Strasbourg

    Datacenter Failure Statistics

    • 40% of servers experience crashes/unexpected restarts (Google)
    • 57% of failures lead to VM migrations (Google)
    • Hard drives cause 82% of hardware failures
    • Power & Cooling are the most common cause of outages (71%)
    • Over 60% of failures result in $100,000+ losses
    Link to original
  • Failures and Dependability

    Understanding Failures, Errors, and Faults

    The Fault-Error-Failure Chain

    • Fault: Hypothesized cause of an error
      • A defect in the system (e.g., bug in code, hardware defect)
      • Not all faults lead to errors
    • Error: Deviation from correct system state
      • Manifestation of a fault
      • May exist without causing a failure
      • Examples: erroneous data, inconsistent internal behavior
    • Failure: System service deviating from specification
      • Visible at the service interface
      • Caused by errors propagating to the service interface
      • Examples: crash, incorrect output, timing violation

    Fault Classification

    Faults can be classified along multiple dimensions:

    Phase of Creation or Occurrence

    • Development Faults: Introduced during system development
    • Operational Faults: Occurring during system operation

    System Boundaries

    • Internal Faults: Originating from within the system
    • External Faults: Originating from outside the system

    Phenomenological Cause

    • Natural Faults: Caused by natural phenomena
    • Human-made Faults: Resulting from human actions

    Intent

    • Non-malicious Faults: Without harmful intent
    • Malicious Faults: With harmful intent (attacks)

    Capability/Competence

    • Accidental Faults: Introduced inadvertently
    • Incompetence Faults: Due to lack of skills/knowledge

    Persistence

    • Permanent Faults: Persisting until repaired
    • Transient Faults: Appearing then disappearing

    Failure Spectrum

    Failure isn’t binary but exists on a spectrum:

    • Optimal Service: Meeting functional requirements and balancing all quality attributes
    • Partial Failure: Some parts of the system fail while others continue
    • Degraded Service: System functions but with reduced performance
    • Transient Failure: Temporary interruption with automatic recovery
    • Complete Failure: System becomes unresponsive or produces incorrect results

    Dependability Attributes

    Dependability Tree

    • Attributes

      • Availability: Readiness for correct service
      • Reliability: Continuity of correct service
      • Safety: Freedom from catastrophic consequences
      • Confidentiality: Absence of unauthorized disclosure
      • Integrity: Absence of improper system alterations
      • Maintainability: Ability to undergo repair and evolution
    • Threats

      • Faults
      • Errors
      • Failures
    • Means

      • Fault Prevention
      • Fault Tolerance
      • Fault Removal
      • Fault Forecasting

    Availability and Reliability

    Distinction

    • Availability: System readiness for service when needed
      • Measured as percentage of uptime
      • Focused on accessibility
    • Reliability: System’s ability to function without failure over time
      • Measured as Mean Time Between Failures (MTBF)
      • Focused on continuity

    Examples

    • System with 99.99% availability but produces incorrect results occasionally: High availability, low reliability
    • System that never crashes but shuts down for maintenance one week each year: High reliability, lower availability (98%)
    Link to original
  • High Availability

    Importance of High Availability

    Business Impact

    • Downtime can be extremely costly in today’s interconnected world
    • Minimizes business disruptions, maintains customer satisfaction, and protects revenue

    User Expectations

    • Users expect 24/7 service availability
    • Poor availability damages reputation and user trust

    Critical Systems

    • Essential for healthcare, finance, emergency services, and other critical infrastructure
    • Directly impacts safety and well-being

    Availability Levels (The “9’s”)

    AvailabilityDowntime per YearDowntime per MonthDowntime per Week
    90% (one nine)36.5 days72 hours16.8 hours
    99% (two nines)3.65 days7.2 hours1.68 hours
    99.9% (three nines)8.76 hours43.8 min10.1 min
    99.99% (four nines)52.6 min4.38 min1.01 min
    99.999% (five nines)5.26 min25.9 s6.06 s
    99.9999% (six nines)31.56 s2.59 s0.61 s
    99.99999% (seven nines)3.16 s259 ms61 ms
    • Each additional “9” represents an order-of-magnitude reduction in downtime
    • Higher availability systems require exponentially more effort and resources

    Means to Achieve Dependability

    Fault Prevention

    • Approach: Prevent occurrence of faults proactively
    • Techniques:
      • Suitable design patterns
      • Rigorous requirements analysis
      • Formal verification methods
      • Code reviews and static analysis

    Fault Tolerance

    • Approach: Design systems to continue operation despite faults
    • Techniques:
      • Redundancy in components and systems
      • Error detection mechanisms
      • Recovery mechanisms

    Fault Removal

    • Approach: Identify and reduce existing faults
    • Techniques:
      • Early prototyping
      • Thorough testing
      • Static code analysis
      • Debugging

    Fault Forecasting

    • Approach: Predict future fault occurrence and consequences
    • Techniques:
      • Performance monitoring
      • Incident report analysis
      • Vulnerability auditing

    Foundations of High Availability

    Fault Tolerance

    Key strategies for fault tolerance:

    • Error detection
    • Failover mechanisms (error recovery)
    • Load balancing
    • Redundancy/replication
    • Auto-scaling
    • Graceful degradation
    • Fault isolation

    Error Detection in Data Centers

    • Monitoring: Collecting metrics like CPU, memory, disk I/O
      • Heartbeats for basic health indication
      • Threshold monitoring for overload detection
    • Telemetry: Analyzing metrics across servers
      • Identifying patterns and anomalies
      • Detecting potential security threats
    • Observability: Understanding internal state through outputs
      • Log analysis
      • Tracing communications through the system

    Circuit Breaker Pattern

    • Inspired by electrical circuit breakers
    • States: Closed (normal), Open (after failures), Half-open (testing recovery)
    • Prevents overload of failing services
    • Fails fast rather than degrading under stress

    Hardware Error Detection

    • ECC Memory: Detects and corrects single-bit errors
    • Redundant components: Multiple power supplies, network interfaces

    Real-world Examples

    • Uber’s M3: Platform for storing and querying time-series metrics
    • Netflix’s Mantis: Stream processing of real-time data for monitoring

    Failover Strategies

    Active-Passive Failover

    • Active: Primary system handling all workload
    • Passive: Idle standby system synchronized with active
    • Failover: When active fails, passive becomes active
    • Variations:
      • Cold Standby: Needs booting and configuration
      • Warm Standby: Running but periodically synchronized
      • Hot Standby: Fully synchronized and ready to take over

    Active-Active Failover

    • Multiple systems simultaneously handling workload
    • Load balancer distributes traffic
    • When one system fails, others take over
    • Provides immediate recovery with no downtime

    Decision Factors for Failover Strategy

    • State management and consistency requirements
    • Recovery Time Objective (RTO)
    • Cost constraints
    • Operational complexity
    Link to original

    Modern Cloud Architectures - Microservices

    Evolution from Monolith to Microservices

    Traditional monolithic applications face challenges as they grow:

    • Increasingly difficult to maintain
    • Hard to scale specific components
    • Complex to evolve with changing requirements
    • Technology lock-in

    Microservices architecture emerged as a solution to these challenges.

    What Are Microservices?

    Microservices architecture is an approach to develop a single application as a suite of small services, each:

    • Running in its own process
    • Communicating through lightweight mechanisms (often HTTP/REST APIs)
    • Independently deployable
    • Built around business capabilities
    • Potentially implemented using different technologies

    Key Characteristics of Microservices

    • Loose coupling: Services interact through well-defined interfaces
    • Independent deployment: Each service can be deployed without affecting others
    • Technology diversity: Different services can use different technologies
    • Focused on business capabilities: Services aligned with business domains
    • Small size: Each service focuses on doing one thing well
    • Decentralized data management: Each service manages its own data
    • Automated deployment: CI/CD pipelines for each service
    • Designed for failure: Resilience built in through isolation

    Microservices Architecture Components

    A typical microservices architecture includes:

    1. Core Services: Implement business functionality
    2. API Gateway: Provides a single entry point for clients
    3. Service Registry: Keeps track of service instances and locations
    4. Config Server: Centralized configuration management
    5. Monitoring and Tracing: Distributed system observability
    6. Load Balancer: Distributes traffic among service instances

    Advantages of Microservices

    1. Independent Development:

      • Teams can work on different services simultaneously
      • Faster development cycles
      • Smaller codebases are easier to understand
    2. Technology Flexibility:

      • Each service can use the most appropriate tech stack
      • Easier to adopt new technologies incrementally
    3. Scalability:

      • Services can be scaled independently based on demand
      • More efficient resource utilization
    4. Fault Isolation:

      • Failures in one service don’t necessarily affect others
      • Easier to implement resilience patterns
    5. Maintainability:

      • Smaller codebases are less complex
      • Easier to understand and debug
      • New team members can become productive faster
    6. Reusability:

      • Services can be reused in different contexts
      • Example: Netflix Asgard, Eureka services used in multiple projects

    Disadvantages of Microservices

    1. Complexity:

      • Increased operational overhead with more services to manage and monitor
      • Distributed debugging challenges - tracing issues across multiple services
      • Complexity of service interactions and dependencies
    2. Performance Overhead:

      • Latency due to network communication between services
      • Serialization/deserialization costs
      • Network bandwidth consumption
    3. Operational Challenges:

      • Microservice sprawl - could expand to hundreds or thousands of services
      • Managing CI/CD pipelines for multiple services
      • End-to-end testing becomes more difficult
    4. Failure Patterns:

      • Interdependency chains can cause cascading failures
      • Death spirals (failures in containers of the same service)
      • Retry storms (wasted resources on failed calls)
      • Cascading QoS violations due to bottleneck services
      • Failure recovery potentially slower than in monoliths

    Microservice Communication

    Synchronous Communication

    • REST APIs (HTTP/HTTPS): Simple request-response pattern
    • gRPC: Efficient binary protocol with bidirectional streaming
    • GraphQL: Query-based, client specifies exactly what data it needs

    Pros:

    • Immediate response
    • Simpler to implement
    • Easier to debug

    Cons:

    • Tight coupling
    • Higher latency
    • Lower fault tolerance

    Asynchronous Communication

    • Message queues: RabbitMQ, ActiveMQ
    • Event streaming: Apache Kafka, AWS Kinesis
    • Pub/Sub pattern: Google Cloud Pub/Sub

    Pros:

    • Loose coupling
    • Better scalability
    • Higher fault tolerance

    Cons:

    • More complex to implement
    • Harder to debug
    • Eventually consistent

    Glueware and Support Infrastructure

    Microservices require substantial supporting infrastructure (“glueware”) that often outweighs the core services:

    • Monitoring and logging systems
    • Service discovery mechanisms
    • Load balancing services
    • API gateways
    • Message brokers
    • Circuit breakers for resilience
    • Distributed tracing tools
    • Configuration management

    According to the Cloud Native Computing Foundation’s 2022 survey, glueware now outweighs core microservices in most deployments.

    Avoiding Microservice Sprawl

    To prevent excessive complexity with microservices:

    1. Start with a monolith design

      • Gradually break it down into microservices as needed
      • Identify natural boundaries and avoid over-decomposition
    2. Focus on business capabilities

      • Design around clear business purposes rather than technical functions
    3. Establish clear governance

      • Define guidelines and best practices for microservice development
      • Create standards for naming conventions, communication protocols, etc.
    4. Implement fault-tolerant design patterns

      • Timeouts, bounded retries, circuit breakers
      • Graceful degradation
    Link to original