Data centres are the backbone of cloud computing, and their design plays a crucial role in ensuring sustainability, reliability, and efficiency. This note focuses on the infrastructure design aspects that enable dependable and sustainable data centre operations.

Data Centre Infrastructure Basics

A modern data centre consists of several key components:

  • Servers: Individual compute units, typically rack-mounted
  • Racks: Metal frames housing multiple servers
  • Cooling systems: Equipment to remove heat generated by servers
  • Power distribution systems: Deliver electricity to all equipment
  • Network infrastructure: Connects servers internally and to the outside world
  • Physical security systems: Control access to the facility

Designing for Hardware Redundancy

Geographic Redundancy

  • Definition: Distributing data centres across multiple geographic regions
  • Purpose: Mitigate impact of regional outages (natural disasters, power grid failures)
  • Implementation:
    • Multiple data centres in different regions
    • Data replication across regions
    • Load balancing between regions
  • Benefit: Ensures continued operation even if an entire region goes offline

Server Redundancy

  • Definition: Deploying servers in clusters with automatic failover mechanisms
  • Purpose: Ensure service availability despite individual server failures
  • Implementation:
    • Server clusters managed by virtualization technology
    • Automatic failover when hardware issues are detected
    • N+1 or N+2 redundancy (extra servers beyond minimum requirements)
  • Benefit: Seamless operation during hardware failures

Storage Redundancy

  • Definition: Replicating data across multiple storage devices and technologies
  • Purpose: Prevent data loss due to disk or storage system failures
  • Implementation:
    • RAID configurations to protect against disk failures
    • Replication within and across data centres
    • Multiple storage technologies (SSD, HDD, tape) for different tiers
  • Benefit: Data remains accessible and intact despite storage component failures

Network Redundancy

Reliable networking is critical for data centre operations. Redundancy is implemented at multiple levels:

Server-level Network Redundancy

  • Redundant Network Interface Cards (NICs) on each server
  • Dual or more power supplies to eliminate single points of failure
  • Multiple network paths from each server

Network-level Redundancy

  • Redundant switches, routers, firewalls, and load balancers
  • Multiple connection paths between network devices
  • Diverse carrier connections for external connectivity
  • Link aggregation: Multiple physical links between network devices
  • Spanning Tree Protocol (STP): Prevents network loops while maintaining redundancy
  • Equal-Cost Multi-Path (ECMP): Distributes traffic across multiple paths

Network Topologies for Redundancy

  1. Hierarchical/3-tier topology:

    • Access layer (connects to servers)
    • Aggregation layer (connects access switches)
    • Core layer (high-speed backbone)
    • Redundant connections between layers
  2. Fat-tree/Clos topology:

    • Non-blocking architecture
    • Multiple equal-cost paths between any two servers
    • Better scalability and fault tolerance than traditional hierarchical designs

Power Redundancy

Data centres require constant and reliable power supply to function:

  • Multiple power feeds from different utility substations

  • Uninterruptible Power Supplies (UPS) for temporary outages

    • Battery systems that provide immediate power during utility failures
    • Typically designed to support the data centre for minutes to hours
  • Backup generators for medium/long-term outages

    • Diesel or natural gas powered
    • Automatically start when utility power fails
    • Sized to power the entire facility for days
  • Power Distribution Units (PDUs) with dual power inputs

    • Ensure continuous rack power
    • Allow maintenance of one power path without downtime

Power Redundancy Configurations

  • N: Basic capacity with no redundancy
  • N+1: Basic capacity plus one additional component
  • 2N: Fully redundant, two complete power paths
  • 2N+1: Fully redundant with additional backup

Cooling Redundancy

Data centres generate significant heat that must be removed efficiently:

  • Heating, Ventilation, and Air Conditioning (HVAC) systems

    • Control temperature, humidity, and air quality
    • Critical for equipment longevity and reliability
  • Cooling redundancy measures:

    • N+1 cooling: One extra cooling unit beyond required capacity
    • Multiple cooling technologies to mitigate failure modes
      • Computer Room Air Conditioning (CRAC) units
      • Free cooling (using outside air when temperature permits)
      • In-row cooling (targeted cooling closer to heat sources)
    • Redundant cooling loops – pipes, heat exchangers, pumps
    • Hot/Cold aisle containment – prevents hot and cold air mixing

Advanced Cooling Technologies

  • Free cooling: Using outside air when temperature permits
  • Liquid cooling: Direct liquid cooling of components
  • Immersion cooling: Servers submerged in non-conductive liquid
  • Evaporative cooling: Using water evaporation to reduce temperatures

Design Standards and Tiers

The Uptime Institute defines four tiers of data centre reliability:

  1. Tier I: Basic Capacity

    • Single path for power and cooling
    • No redundant components
    • 99.671% availability (28.8 hours downtime/year)
  2. Tier II: Redundant Components

    • Single path for power and cooling
    • Redundant components
    • 99.741% availability (22.0 hours downtime/year)
  3. Tier III: Concurrently Maintainable

    • Multiple paths for power and cooling, only one active
    • Redundant components
    • 99.982% availability (1.6 hours downtime/year)
  4. Tier IV: Fault Tolerant

    • Multiple active paths for power and cooling
    • Redundant components
    • 99.995% availability (0.4 hours downtime/year)
    • Can withstand any single equipment failure without impact

Sustainable Design Considerations

Modern data centre design increasingly incorporates sustainability features:

  • Energy-efficient equipment selection
  • Renewable energy sources (solar, wind, hydroelectric)
  • Heat recovery systems to repurpose waste heat
  • Water-efficient cooling technologies
  • Modular designs for efficient expansion
  • Smart monitoring systems to optimize resource usage

Real-world Implementation Challenges

Designing highly redundant data centres faces several challenges:

  • Cost vs. reliability tradeoffs
  • Physical space constraints
  • Regulatory and compliance requirements
  • Upgrading existing facilities
  • Integrating new technologies with legacy systems
  • Balancing performance and sustainability goals

Related: Cloud Sustainability - Carbon Footprint Frameworks, Cloud Sustainability - Measurement Granularities, Cloud System Design - High Availability