Quality attributes are non-functional requirements that determine the success of a cloud system beyond its basic functionality.

Core Quality Attributes

1. Performance

  • Workload handling: Capacity to process the required volume of operations
  • Efficiency: Resource usage in relation to output
  • Responsiveness: Speed of response to user requests or events
  • Throughput: Total amount of work accomplished in a given time period
  • Latency: Time delay between action and response

2. Cost

  • Build/deployment costs: Initial setup expenses
  • Operational costs: Ongoing expenses to run the system
  • Maintenance costs: Expenses for updates, fixes, and improvements
  • Resource optimization: Efficient use of hardware, software, and human resources
  • Scaling costs: Expenses related to growth or contraction

3. Security

  • Access control: Prevention of unauthorized access
  • Data protection: Safeguarding sensitive information
  • Integrity: Ensuring data remains uncorrupted
  • Confidentiality: Keeping private information private
  • Compliance: Meeting regulatory requirements

4. Dependability

  • Availability: Readiness for correct service
  • Reliability: Continuity of correct service
  • Safety: Freedom from catastrophic consequences
  • Integrity: Absence of improper system alterations
  • Maintainability: Ability to undergo repairs and modifications

Service and Failure Concepts

Correct Service vs. Failure

  • Correct service: System implements its function as specified
  • Failure: Deviation from the functional specification
    • Not binary but exists on a spectrum from optimal to complete failure

Quality of Service (QoS)

  • A measure of how well a system performs
  • The ability to provide guaranteed performance levels
  • Multiple dimensions: latency, bandwidth, security, availability, etc.
  • Highly contextual and defined for specific applications
  • Goal: Highest QoS despite faults at the lowest cost

Potential Failure Sources in Datacenters

Hardware Failures

  • Node/server failures (crashes, timing issues, data corruption)
  • Power failures (crashes, possible data corruption)
  • Physical accidents (fire, flood, earthquakes)

Network Failures

  • Router/gateway failures affecting entire subnets
  • Name server failures impacting name domains
  • Network congestion leading to dropped packets

Software and Human Factors

  • Software complexity leading to bugs
  • Misconfiguration and human error
  • Security attacks (both external and internal)

Real-world Datacenter Failures

  • 2008: Amazon S3 major outages affecting US & EU
  • 2011: Amazon EBS and RDS outage lasting 4 days
  • 2015: Apple service disruptions (iTunes, iCloud, Photos)
  • 2016: Google Cloud Platform significant outage
  • 2021: OVHcloud fire destroying datacenters in Strasbourg

Datacenter Failure Statistics

  • 40% of servers experience crashes/unexpected restarts (Google)
  • 57% of failures lead to VM migrations (Google)
  • Hard drives cause 82% of hardware failures
  • Power & Cooling are the most common cause of outages (71%)
  • Over 60% of failures result in $100,000+ losses