Quality attributes are non-functional requirements that determine the success of a cloud system beyond its basic functionality.
Core Quality Attributes
1. Performance
- Workload handling: Capacity to process the required volume of operations
- Efficiency: Resource usage in relation to output
- Responsiveness: Speed of response to user requests or events
- Throughput: Total amount of work accomplished in a given time period
- Latency: Time delay between action and response
2. Cost
- Build/deployment costs: Initial setup expenses
- Operational costs: Ongoing expenses to run the system
- Maintenance costs: Expenses for updates, fixes, and improvements
- Resource optimization: Efficient use of hardware, software, and human resources
- Scaling costs: Expenses related to growth or contraction
3. Security
- Access control: Prevention of unauthorized access
- Data protection: Safeguarding sensitive information
- Integrity: Ensuring data remains uncorrupted
- Confidentiality: Keeping private information private
- Compliance: Meeting regulatory requirements
4. Dependability
- Availability: Readiness for correct service
- Reliability: Continuity of correct service
- Safety: Freedom from catastrophic consequences
- Integrity: Absence of improper system alterations
- Maintainability: Ability to undergo repairs and modifications
Service and Failure Concepts
Correct Service vs. Failure
- Correct service: System implements its function as specified
- Failure: Deviation from the functional specification
- Not binary but exists on a spectrum from optimal to complete failure
Quality of Service (QoS)
- A measure of how well a system performs
- The ability to provide guaranteed performance levels
- Multiple dimensions: latency, bandwidth, security, availability, etc.
- Highly contextual and defined for specific applications
- Goal: Highest QoS despite faults at the lowest cost
Potential Failure Sources in Datacenters
Hardware Failures
- Node/server failures (crashes, timing issues, data corruption)
- Power failures (crashes, possible data corruption)
- Physical accidents (fire, flood, earthquakes)
Network Failures
- Router/gateway failures affecting entire subnets
- Name server failures impacting name domains
- Network congestion leading to dropped packets
Software and Human Factors
- Software complexity leading to bugs
- Misconfiguration and human error
- Security attacks (both external and internal)
Real-world Datacenter Failures
- 2008: Amazon S3 major outages affecting US & EU
- 2011: Amazon EBS and RDS outage lasting 4 days
- 2015: Apple service disruptions (iTunes, iCloud, Photos)
- 2016: Google Cloud Platform significant outage
- 2021: OVHcloud fire destroying datacenters in Strasbourg
Datacenter Failure Statistics
- 40% of servers experience crashes/unexpected restarts (Google)
- 57% of failures lead to VM migrations (Google)
- Hard drives cause 82% of hardware failures
- Power & Cooling are the most common cause of outages (71%)
- Over 60% of failures result in $100,000+ losses