Cloud System Design

Distributed System Fundamentals
What Is a Distributed System?

A distributed system can be defined in several ways:
- Tanenbaum and van Steen: “A collection of independent computers that appears to its users as a single coherent system”
- Coulouris, Dollimore and Kindberg: “One in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages”
- Lamport: “One that stops you getting work done when a machine you’ve never even heard of crashes”
Motivations for Distributed Systems
1. Geographic Distribution: Resources and users are naturally distributed
  - Example: Banking services accessible from different locations while data is centrally stored
2. Fault Tolerance: Problems rarely affect multiple locations simultaneously
  - Multiple database servers in different rooms provide better reliability
3. Performance and Scalability: Combining resources for enhanced capabilities
  - High Performance Computing, replicated web servers, etc.
Examples of Distributed Systems
- Financial trading platforms
- Web search engines (processing 50+ billion web pages)
- Social media platforms supporting billions of users
- Large Language Models (trained across clusters)
- Scientific research (e.g., CERN with over 1 Exabyte of data)
- Content Delivery Networks (CDNs)
- Online multiplayer games
Fallacies of Distributed Computing

Eight classic assumptions that often lead to problematic distributed systems designs (identified at Sun Microsystems):
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. There is one administrator
6. Transport cost is zero
7. The network is homogeneous
8. Topology doesn’t change
Key Aspects of Distributed System Design
- System Function: The intended purpose (features and capabilities)
- System Behavior: How the system performs its functions
- Quality Attributes: Core qualities determining success:
  - Performance
  - Cost
  - Security
  - Dependability
Challenges in Distributed Systems

Distributed systems introduce complexity in:
- Coordination
- Consistency
- Fault detection and recovery
- Security
- Performance optimization
Link to original
Cloud Systems Quality Attributes
Quality attributes are non-functional requirements that determine the success of a cloud system beyond its basic functionality.

Core Quality Attributes

1. Performance
- Workload handling: Capacity to process the required volume of operations
- Efficiency: Resource usage in relation to output
- Responsiveness: Speed of response to user requests or events
- Throughput: Total amount of work accomplished in a given time period
- Latency: Time delay between action and response
2. Cost
- Build/deployment costs: Initial setup expenses
- Operational costs: Ongoing expenses to run the system
- Maintenance costs: Expenses for updates, fixes, and improvements
- Resource optimization: Efficient use of hardware, software, and human resources
- Scaling costs: Expenses related to growth or contraction
3. Security
- Access control: Prevention of unauthorized access
- Data protection: Safeguarding sensitive information
- Integrity: Ensuring data remains uncorrupted
- Confidentiality: Keeping private information private
- Compliance: Meeting regulatory requirements
4. Dependability
- Availability: Readiness for correct service
- Reliability: Continuity of correct service
- Safety: Freedom from catastrophic consequences
- Integrity: Absence of improper system alterations
- Maintainability: Ability to undergo repairs and modifications
Service and Failure Concepts

Correct Service vs. Failure
- Correct service: System implements its function as specified
- Failure: Deviation from the functional specification
  - Not binary but exists on a spectrum from optimal to complete failure
Quality of Service (QoS)
- A measure of how well a system performs
- The ability to provide guaranteed performance levels
- Multiple dimensions: latency, bandwidth, security, availability, etc.
- Highly contextual and defined for specific applications
- Goal: Highest QoS despite faults at the lowest cost
Potential Failure Sources in Datacenters

Hardware Failures
- Node/server failures (crashes, timing issues, data corruption)
- Power failures (crashes, possible data corruption)
- Physical accidents (fire, flood, earthquakes)
Network Failures
- Router/gateway failures affecting entire subnets
- Name server failures impacting name domains
- Network congestion leading to dropped packets
Software and Human Factors
- Software complexity leading to bugs
- Misconfiguration and human error
- Security attacks (both external and internal)
Real-world Datacenter Failures
- 2008: Amazon S3 major outages affecting US & EU
- 2011: Amazon EBS and RDS outage lasting 4 days
- 2015: Apple service disruptions (iTunes, iCloud, Photos)
- 2016: Google Cloud Platform significant outage
- 2021: OVHcloud fire destroying datacenters in Strasbourg
Datacenter Failure Statistics
- 40% of servers experience crashes/unexpected restarts (Google)
- 57% of failures lead to VM migrations (Google)
- Hard drives cause 82% of hardware failures
- Power & Cooling are the most common cause of outages (71%)
- Over 60% of failures result in $100,000+ losses
Link to original
Failures and Dependability
Understanding Failures, Errors, and Faults

The Fault-Error-Failure Chain
- Fault: Hypothesized cause of an error
  - A defect in the system (e.g., bug in code, hardware defect)
  - Not all faults lead to errors
- Error: Deviation from correct system state
  - Manifestation of a fault
  - May exist without causing a failure
  - Examples: erroneous data, inconsistent internal behavior
- Failure: System service deviating from specification
  - Visible at the service interface
  - Caused by errors propagating to the service interface
  - Examples: crash, incorrect output, timing violation
Fault Classification

Faults can be classified along multiple dimensions:

Phase of Creation or Occurrence
- Development Faults: Introduced during system development
- Operational Faults: Occurring during system operation
System Boundaries
- Internal Faults: Originating from within the system
- External Faults: Originating from outside the system
Phenomenological Cause
- Natural Faults: Caused by natural phenomena
- Human-made Faults: Resulting from human actions
Intent
- Non-malicious Faults: Without harmful intent
- Malicious Faults: With harmful intent (attacks)
Capability/Competence
- Accidental Faults: Introduced inadvertently
- Incompetence Faults: Due to lack of skills/knowledge
Persistence
- Permanent Faults: Persisting until repaired
- Transient Faults: Appearing then disappearing
Failure Spectrum

Failure isn’t binary but exists on a spectrum:
- Optimal Service: Meeting functional requirements and balancing all quality attributes
- Partial Failure: Some parts of the system fail while others continue
- Degraded Service: System functions but with reduced performance
- Transient Failure: Temporary interruption with automatic recovery
- Complete Failure: System becomes unresponsive or produces incorrect results
Dependability Attributes

Dependability Tree
- Attributes
  - Availability: Readiness for correct service
  - Reliability: Continuity of correct service
  - Safety: Freedom from catastrophic consequences
  - Confidentiality: Absence of unauthorized disclosure
  - Integrity: Absence of improper system alterations
  - Maintainability: Ability to undergo repair and evolution
- Threats
  - Faults
  - Errors
  - Failures
- Means
  - Fault Prevention
  - Fault Tolerance
  - Fault Removal
  - Fault Forecasting
Availability and Reliability

Distinction
- Availability: System readiness for service when needed
  - Measured as percentage of uptime
  - Focused on accessibility
- Reliability: System’s ability to function without failure over time
  - Measured as Mean Time Between Failures (MTBF)
  - Focused on continuity
Examples
- System with 99.99% availability but produces incorrect results occasionally: High availability, low reliability
- System that never crashes but shuts down for maintenance one week each year: High reliability, lower availability (98%)
Link to original

High Availability
Importance of High Availability

Business Impact

Downtime can be extremely costly in today’s interconnected world

Minimizes business disruptions, maintains customer satisfaction, and protects revenue

User Expectations

Users expect 24/7 service availability

Poor availability damages reputation and user trust

Critical Systems

Essential for healthcare, finance, emergency services, and other critical infrastructure

Directly impacts safety and well-being

Availability Levels (The “9’s”)

Availability Downtime per Year Downtime per Month Downtime per Week
90% (one nine) 36.5 days 72 hours 16.8 hours
99% (two nines) 3.65 days 7.2 hours 1.68 hours
99.9% (three nines) 8.76 hours 43.8 min 10.1 min
99.99% (four nines) 52.6 min 4.38 min 1.01 min
99.999% (five nines) 5.26 min 25.9 s 6.06 s
99.9999% (six nines) 31.56 s 2.59 s 0.61 s
99.99999% (seven nines) 3.16 s 259 ms 61 ms

Each additional “9” represents an order-of-magnitude reduction in downtime

Higher availability systems require exponentially more effort and resources

Means to Achieve Dependability

Fault Prevention

Approach: Prevent occurrence of faults proactively

Techniques:

Suitable design patterns

Rigorous requirements analysis

Formal verification methods

Code reviews and static analysis

Fault Tolerance

Approach: Design systems to continue operation despite faults

Techniques:

Redundancy in components and systems

Error detection mechanisms

Recovery mechanisms

Fault Removal

Approach: Identify and reduce existing faults

Techniques:

Early prototyping

Thorough testing

Static code analysis

Debugging

Fault Forecasting

Approach: Predict future fault occurrence and consequences

Techniques:

Performance monitoring

Incident report analysis

Vulnerability auditing

Foundations of High Availability

Fault Tolerance

Key strategies for fault tolerance:

Error detection

Failover mechanisms (error recovery)

Load balancing

Redundancy/replication

Auto-scaling

Graceful degradation

Fault isolation

Error Detection in Data Centers

Monitoring: Collecting metrics like CPU, memory, disk I/O

Heartbeats for basic health indication

Threshold monitoring for overload detection

Telemetry: Analyzing metrics across servers

Identifying patterns and anomalies

Detecting potential security threats

Observability: Understanding internal state through outputs

Log analysis

Tracing communications through the system

Circuit Breaker Pattern

Inspired by electrical circuit breakers

States: Closed (normal), Open (after failures), Half-open (testing recovery)

Prevents overload of failing services

Fails fast rather than degrading under stress

Hardware Error Detection

ECC Memory: Detects and corrects single-bit errors

Redundant components: Multiple power supplies, network interfaces

Real-world Examples

Uber’s M3: Platform for storing and querying time-series metrics

Netflix’s Mantis: Stream processing of real-time data for monitoring

Failover Strategies

Active-Passive Failover

Active: Primary system handling all workload

Passive: Idle standby system synchronized with active

Failover: When active fails, passive becomes active

Variations:

Cold Standby: Needs booting and configuration

Warm Standby: Running but periodically synchronized

Hot Standby: Fully synchronized and ready to take over

Active-Active Failover

Multiple systems simultaneously handling workload

Load balancer distributes traffic

When one system fails, others take over

Provides immediate recovery with no downtime

Decision Factors for Failover Strategy

State management and consistency requirements

Recovery Time Objective (RTO)

Cost constraints

Operational complexity

Link to original

Availability	Downtime per Year	Downtime per Month	Downtime per Week
90% (one nine)	36.5 days	72 hours	16.8 hours
99% (two nines)	3.65 days	7.2 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.8 min	10.1 min
99.99% (four nines)	52.6 min	4.38 min	1.01 min
99.999% (five nines)	5.26 min	25.9 s	6.06 s
99.9999% (six nines)	31.56 s	2.59 s	0.61 s
99.99999% (seven nines)	3.16 s	259 ms	61 ms

Modern Cloud Architectures - Microservices
Evolution from Monolith to Microservices

Traditional monolithic applications face challenges as they grow:

Increasingly difficult to maintain

Hard to scale specific components

Complex to evolve with changing requirements

Technology lock-in

Microservices architecture emerged as a solution to these challenges.

What Are Microservices?

Microservices architecture is an approach to develop a single application as a suite of small services, each:

Running in its own process

Communicating through lightweight mechanisms (often HTTP/REST APIs)

Independently deployable

Built around business capabilities

Potentially implemented using different technologies

Key Characteristics of Microservices

Loose coupling: Services interact through well-defined interfaces

Independent deployment: Each service can be deployed without affecting others

Technology diversity: Different services can use different technologies

Focused on business capabilities: Services aligned with business domains

Small size: Each service focuses on doing one thing well

Decentralized data management: Each service manages its own data

Automated deployment: CI/CD pipelines for each service

Designed for failure: Resilience built in through isolation

Microservices Architecture Components

A typical microservices architecture includes:

Core Services: Implement business functionality

API Gateway: Provides a single entry point for clients

Service Registry: Keeps track of service instances and locations

Config Server: Centralized configuration management

Monitoring and Tracing: Distributed system observability

Load Balancer: Distributes traffic among service instances

Advantages of Microservices

Independent Development:

Teams can work on different services simultaneously

Faster development cycles

Smaller codebases are easier to understand

Technology Flexibility:

Each service can use the most appropriate tech stack

Easier to adopt new technologies incrementally

Scalability:

Services can be scaled independently based on demand

More efficient resource utilization

Fault Isolation:

Failures in one service don’t necessarily affect others

Easier to implement resilience patterns

Maintainability:

Smaller codebases are less complex

Easier to understand and debug

New team members can become productive faster

Reusability:

Services can be reused in different contexts

Example: Netflix Asgard, Eureka services used in multiple projects

Disadvantages of Microservices

Complexity:

Increased operational overhead with more services to manage and monitor

Distributed debugging challenges - tracing issues across multiple services

Complexity of service interactions and dependencies

Performance Overhead:

Latency due to network communication between services

Serialization/deserialization costs

Network bandwidth consumption

Operational Challenges:

Microservice sprawl - could expand to hundreds or thousands of services

Managing CI/CD pipelines for multiple services

End-to-end testing becomes more difficult

Failure Patterns:

Interdependency chains can cause cascading failures

Death spirals (failures in containers of the same service)

Retry storms (wasted resources on failed calls)

Cascading QoS violations due to bottleneck services

Failure recovery potentially slower than in monoliths

Microservice Communication

Synchronous Communication

REST APIs (HTTP/HTTPS): Simple request-response pattern

gRPC: Efficient binary protocol with bidirectional streaming

GraphQL: Query-based, client specifies exactly what data it needs

Pros:

Immediate response

Simpler to implement

Easier to debug

Cons:

Tight coupling

Higher latency

Lower fault tolerance

Asynchronous Communication

Message queues: RabbitMQ, ActiveMQ

Event streaming: Apache Kafka, AWS Kinesis

Pub/Sub pattern: Google Cloud Pub/Sub

Pros:

Loose coupling

Better scalability

Higher fault tolerance

Cons:

More complex to implement

Harder to debug

Eventually consistent

Glueware and Support Infrastructure

Microservices require substantial supporting infrastructure (“glueware”) that often outweighs the core services:

Monitoring and logging systems

Service discovery mechanisms

Load balancing services

API gateways

Message brokers

Circuit breakers for resilience

Distributed tracing tools

Configuration management

According to the Cloud Native Computing Foundation’s 2022 survey, glueware now outweighs core microservices in most deployments.

Avoiding Microservice Sprawl

To prevent excessive complexity with microservices:

Start with a monolith design

Gradually break it down into microservices as needed

Identify natural boundaries and avoid over-decomposition

Focus on business capabilities

Design around clear business purposes rather than technical functions

Establish clear governance

Define guidelines and best practices for microservice development

Create standards for naming conventions, communication protocols, etc.

Implement fault-tolerant design patterns

Timeouts, bounded retries, circuit breakers

Graceful degradation

Link to original

Quartz 4

Explorer

Cloud System Design

Distributed System Fundamentals

What Is a Distributed System?

Motivations for Distributed Systems

Examples of Distributed Systems

Fallacies of Distributed Computing

Key Aspects of Distributed System Design

Challenges in Distributed Systems

Cloud Systems Quality Attributes

Core Quality Attributes

1. Performance

2. Cost

3. Security

4. Dependability

Service and Failure Concepts

Correct Service vs. Failure

Quality of Service (QoS)

Potential Failure Sources in Datacenters

Hardware Failures

Network Failures

Software and Human Factors

Real-world Datacenter Failures

Datacenter Failure Statistics

Failures and Dependability

Understanding Failures, Errors, and Faults

The Fault-Error-Failure Chain

Fault Classification

Phase of Creation or Occurrence

System Boundaries

Phenomenological Cause

Intent

Capability/Competence

Persistence

Failure Spectrum

Dependability Attributes

Dependability Tree

Availability and Reliability

Distinction

Examples

High Availability

Importance of High Availability

Business Impact

User Expectations

Critical Systems

Availability Levels (The “9’s”)

Means to Achieve Dependability

Fault Prevention

Fault Tolerance

Fault Removal

Fault Forecasting

Foundations of High Availability

Fault Tolerance

Error Detection in Data Centers

Circuit Breaker Pattern

Hardware Error Detection

Real-world Examples

Failover Strategies

Active-Passive Failover

Active-Active Failover

Decision Factors for Failover Strategy

Modern Cloud Architectures - Microservices

Evolution from Monolith to Microservices

What Are Microservices?

Key Characteristics of Microservices

Microservices Architecture Components

Advantages of Microservices

Disadvantages of Microservices

Microservice Communication

Synchronous Communication

Asynchronous Communication

Glueware and Support Infrastructure

Avoiding Microservice Sprawl

Graph View

Backlinks