Understanding Failures, Errors, and Faults

The Fault-Error-Failure Chain

  • Fault: Hypothesized cause of an error
    • A defect in the system (e.g., bug in code, hardware defect)
    • Not all faults lead to errors
  • Error: Deviation from correct system state
    • Manifestation of a fault
    • May exist without causing a failure
    • Examples: erroneous data, inconsistent internal behavior
  • Failure: System service deviating from specification
    • Visible at the service interface
    • Caused by errors propagating to the service interface
    • Examples: crash, incorrect output, timing violation

Fault Classification

Faults can be classified along multiple dimensions:

Phase of Creation or Occurrence

  • Development Faults: Introduced during system development
  • Operational Faults: Occurring during system operation

System Boundaries

  • Internal Faults: Originating from within the system
  • External Faults: Originating from outside the system

Phenomenological Cause

  • Natural Faults: Caused by natural phenomena
  • Human-made Faults: Resulting from human actions

Intent

  • Non-malicious Faults: Without harmful intent
  • Malicious Faults: With harmful intent (attacks)

Capability/Competence

  • Accidental Faults: Introduced inadvertently
  • Incompetence Faults: Due to lack of skills/knowledge

Persistence

  • Permanent Faults: Persisting until repaired
  • Transient Faults: Appearing then disappearing

Failure Spectrum

Failure isn’t binary but exists on a spectrum:

  • Optimal Service: Meeting functional requirements and balancing all quality attributes
  • Partial Failure: Some parts of the system fail while others continue
  • Degraded Service: System functions but with reduced performance
  • Transient Failure: Temporary interruption with automatic recovery
  • Complete Failure: System becomes unresponsive or produces incorrect results

Dependability Attributes

Dependability Tree

  • Attributes

    • Availability: Readiness for correct service
    • Reliability: Continuity of correct service
    • Safety: Freedom from catastrophic consequences
    • Confidentiality: Absence of unauthorized disclosure
    • Integrity: Absence of improper system alterations
    • Maintainability: Ability to undergo repair and evolution
  • Threats

    • Faults
    • Errors
    • Failures
  • Means

    • Fault Prevention
    • Fault Tolerance
    • Fault Removal
    • Fault Forecasting

Availability and Reliability

Distinction

  • Availability: System readiness for service when needed
    • Measured as percentage of uptime
    • Focused on accessibility
  • Reliability: System’s ability to function without failure over time
    • Measured as Mean Time Between Failures (MTBF)
    • Focused on continuity

Examples

  • System with 99.99% availability but produces incorrect results occasionally: High availability, low reliability
  • System that never crashes but shuts down for maintenance one week each year: High reliability, lower availability (98%)