Data Centre Design

Data centres are the backbone of cloud computing, and their design plays a crucial role in ensuring sustainability, reliability, and efficiency. This note focuses on the infrastructure design aspects that enable dependable and sustainable data centre operations.

Data Centre Infrastructure Basics

A modern data centre consists of several key components:

Servers: Individual compute units, typically rack-mounted
Racks: Metal frames housing multiple servers
Cooling systems: Equipment to remove heat generated by servers
Power distribution systems: Deliver electricity to all equipment
Network infrastructure: Connects servers internally and to the outside world
Physical security systems: Control access to the facility

Designing for Hardware Redundancy

Geographic Redundancy

Definition: Distributing data centres across multiple geographic regions
Purpose: Mitigate impact of regional outages (natural disasters, power grid failures)
Implementation:
- Multiple data centres in different regions
- Data replication across regions
- Load balancing between regions
Benefit: Ensures continued operation even if an entire region goes offline

Server Redundancy

Definition: Deploying servers in clusters with automatic failover mechanisms
Purpose: Ensure service availability despite individual server failures
Implementation:
- Server clusters managed by virtualization technology
- Automatic failover when hardware issues are detected
- N+1 or N+2 redundancy (extra servers beyond minimum requirements)
Benefit: Seamless operation during hardware failures

Storage Redundancy

Definition: Replicating data across multiple storage devices and technologies
Purpose: Prevent data loss due to disk or storage system failures
Implementation:
- RAID configurations to protect against disk failures
- Replication within and across data centres
- Multiple storage technologies (SSD, HDD, tape) for different tiers
Benefit: Data remains accessible and intact despite storage component failures

Network Redundancy

Reliable networking is critical for data centre operations. Redundancy is implemented at multiple levels:

Server-level Network Redundancy

Redundant Network Interface Cards (NICs) on each server
Dual or more power supplies to eliminate single points of failure
Multiple network paths from each server

Network-level Redundancy

Redundant switches, routers, firewalls, and load balancers
Multiple connection paths between network devices
Diverse carrier connections for external connectivity

Link and Path-level Redundancy

Link aggregation: Multiple physical links between network devices
Spanning Tree Protocol (STP): Prevents network loops while maintaining redundancy
Equal-Cost Multi-Path (ECMP): Distributes traffic across multiple paths

Network Topologies for Redundancy

Hierarchical/3-tier topology:
- Access layer (connects to servers)
- Aggregation layer (connects access switches)
- Core layer (high-speed backbone)
- Redundant connections between layers
Fat-tree/Clos topology:
- Non-blocking architecture
- Multiple equal-cost paths between any two servers
- Better scalability and fault tolerance than traditional hierarchical designs

Power Redundancy

Data centres require constant and reliable power supply to function:

Multiple power feeds from different utility substations
Uninterruptible Power Supplies (UPS) for temporary outages
- Battery systems that provide immediate power during utility failures
- Typically designed to support the data centre for minutes to hours
Backup generators for medium/long-term outages
- Diesel or natural gas powered
- Automatically start when utility power fails
- Sized to power the entire facility for days
Power Distribution Units (PDUs) with dual power inputs
- Ensure continuous rack power
- Allow maintenance of one power path without downtime

Power Redundancy Configurations

N: Basic capacity with no redundancy
N+1: Basic capacity plus one additional component
2N: Fully redundant, two complete power paths
2N+1: Fully redundant with additional backup

Cooling Redundancy

Data centres generate significant heat that must be removed efficiently:

Heating, Ventilation, and Air Conditioning (HVAC) systems
- Control temperature, humidity, and air quality
- Critical for equipment longevity and reliability
Cooling redundancy measures:
- N+1 cooling: One extra cooling unit beyond required capacity
- Multiple cooling technologies to mitigate failure modes
  - Computer Room Air Conditioning (CRAC) units
  - Free cooling (using outside air when temperature permits)
  - In-row cooling (targeted cooling closer to heat sources)
- Redundant cooling loops – pipes, heat exchangers, pumps
- Hot/Cold aisle containment – prevents hot and cold air mixing

Advanced Cooling Technologies

Free cooling: Using outside air when temperature permits
Liquid cooling: Direct liquid cooling of components
Immersion cooling: Servers submerged in non-conductive liquid
Evaporative cooling: Using water evaporation to reduce temperatures

Design Standards and Tiers

The Uptime Institute defines four tiers of data centre reliability:

Tier I: Basic Capacity
- Single path for power and cooling
- No redundant components
- 99.671% availability (28.8 hours downtime/year)
Tier II: Redundant Components
- Single path for power and cooling
- Redundant components
- 99.741% availability (22.0 hours downtime/year)
Tier III: Concurrently Maintainable
- Multiple paths for power and cooling, only one active
- Redundant components
- 99.982% availability (1.6 hours downtime/year)
Tier IV: Fault Tolerant
- Multiple active paths for power and cooling
- Redundant components
- 99.995% availability (0.4 hours downtime/year)
- Can withstand any single equipment failure without impact

Sustainable Design Considerations

Modern data centre design increasingly incorporates sustainability features:

Energy-efficient equipment selection
Renewable energy sources (solar, wind, hydroelectric)
Heat recovery systems to repurpose waste heat
Water-efficient cooling technologies
Modular designs for efficient expansion
Smart monitoring systems to optimize resource usage

Real-world Implementation Challenges

Designing highly redundant data centres faces several challenges:

Cost vs. reliability tradeoffs
Physical space constraints
Regulatory and compliance requirements
Upgrading existing facilities
Integrating new technologies with legacy systems
Balancing performance and sustainability goals

Quartz 4

Explorer