Cloud Systems Print

Fundamentals

Cloud Computing Introduction
Cloud computing represents a paradigm shift in how computing resources are delivered, managed, and consumed. It provides on-demand access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort.

What is Cloud Computing?

According to the NIST Cloud Definition, cloud computing is:

“A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

Evolution of Distributed Computing

The evolution of distributed computing can be traced through several major paradigms:
1. Clusters - Locally connected homogeneous computers
2. Grids - Loosely coupled, widely distributed heterogeneous resources
3. Clouds - IT resources delivered as a utility
4. Edge and Fog Computing - Cloud services in closer proximity to users and devices
Key Enablers of Cloud Computing

Virtualization

Virtualization is the core technology that enables cloud computing by abstracting physical resources into logical units that can be provisioned on-demand. It allows:
- Sharing of physical resources among multiple users
- Isolation between different workloads
- Rapid provisioning and deprovisioning of resources
The main virtualization approaches in cloud are:
- Virtual Machines - Complete hardware virtualization
- Containers - Lightweight OS-level virtualization
Resource Pooling and Multi-tenancy

Cloud providers maintain large pools of resources that are dynamically allocated to customers, creating economies of scale and high utilization rates.

Automation and Self-service

Cloud systems provide automated interfaces (APIs and web portals) that allow users to provision and manage resources without human intervention from the provider.

Elasticity and Scalability

Cloud resources can scale up or down based on demand, creating the illusion of infinite resources while optimizing resource usage.

Challenges for Cloud Providers

Cloud providers face several key challenges:
- Rapid provisioning of resources without human interaction
- Creating the illusion of infinite resources while managing data centers efficiently
- Maintaining isolation between different users
- Delivering consistent performance despite resource sharing
Link to original
NIST Cloud Definition
The National Institute of Standards and Technology (NIST) has provided the most widely accepted definition of cloud computing, which has become the standard reference in both industry and academia.

Definition

According to NIST Special Publication 800-145:

“Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

The Three Dimensions of Cloud Computing

NIST defines cloud computing along three major dimensions:
1. Five Essential Characteristics
2. Three Service Models
3. Four Deployment Models
Five Essential Characteristics
1. On-demand self-service: Computing capabilities can be provisioned automatically without requiring human interaction with service providers.
2. Broad network access: Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous client platforms.
3. Resource pooling: The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.
4. Rapid elasticity: Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand.
5. Measured service: Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service.
Three Service Models
1. Software as a Service (SaaS): The consumer uses the provider’s applications running on a cloud infrastructure. Applications are accessible from various client devices through either a thin client interface or a program interface.
2. Platform as a Service (PaaS): The consumer deploys consumer-created or acquired applications onto the cloud infrastructure using programming languages, libraries, services, and tools supported by the provider.
3. Infrastructure as a Service (IaaS): The provider provisions processing, storage, networks, and other fundamental computing resources where the consumer can deploy and run arbitrary software, including operating systems and applications.
Four Deployment Models
1. Private Cloud: The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers.
2. Community Cloud: The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from organizations that have shared concerns.
3. Public Cloud: The cloud infrastructure is provisioned for open use by the general public.
4. Hybrid Cloud: The cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology.
Link to original

Clusters vs Grids vs Clouds
The evolution of distributed computing systems has progressed through various paradigms, each building on the previous while addressing different needs and use cases.

Clusters

A cluster is a group of computers that work together as a unified computing resource.

Key Characteristics:

Homogeneity: Clusters typically consist of similar or identical hardware and software systems

Network: Connected via high-speed, low-latency local area networks

Management: Centrally managed as a single system

Purpose: Improve availability, resource utilization, and price/performance ratio

Examples:

HPC (High-Performance Computing) clusters used in scientific research

Analytics clusters at large tech companies (Google, Microsoft, Meta, Alibaba, Amazon)

Load-balanced web server clusters

Database clusters for high availability

Use Cases:

Compute-intensive scientific simulations

Big data analytics

High-availability services

Grids

Grid computing connects distributed, heterogeneous computing resources across organizational boundaries to solve larger problems.

Key Characteristics:

Heterogeneity: Diverse hardware and software resources across different administrative domains

Distribution: Resources are geographically distributed and connected via wide-area networks (internet)

Standardization: Middleware provides standardized interfaces to access diverse resources

Sharing: Resources are shared across organizations for common goals

Examples:

Worldwide LHC (Large Hadron Collider) Computing Grid (WLCG)

Berkeley Open Infrastructure for Network Computing (BOINC)

Earth System Grid Federation (ESGF)

Use Cases:

Large-scale scientific research

Distributed data analysis

Volunteer computing projects

Clouds

Cloud computing provides on-demand access to shared pools of configurable computing resources delivered as a service over a network.

Key Characteristics:

On-Demand Self-Service: Users can provision resources without human interaction from providers

Utility Model: Pay-as-you-go pricing, similar to electricity or water utilities

Resource Pooling: Multi-tenancy with dynamic resource allocation

Elasticity: Ability to scale resources up or down rapidly

Measured Service: Resource usage is monitored, controlled, and reported

Examples:

Amazon Web Services (AWS)

Microsoft Azure

Google Cloud Platform

IBM Cloud

Oracle Cloud

Use Cases:

Web applications and services

Enterprise IT infrastructure

Development and testing environments

Data storage and backup

High-availability and disaster recovery

Comparison

Feature Clusters Grids Clouds
Ownership Single organization Multiple organizations Service providers or organizations
Hardware Homogeneous Heterogeneous Heterogeneous (abstracted)
Location Co-located Geographically distributed Data centers (abstracted from users)
Management Centralized Distributed Centralized for each provider
Scalability Limited by physical resources Limited by participating resources Highly elastic (appears unlimited)
Access Local network, specific interfaces Grid middleware, certificates Standard web protocols, APIs
Business Model Capital expenditure Collaborative Operational expenditure (utility)
Virtualization Limited Limited Extensive

Evolution and Relationship

These paradigms represent an evolution in distributed computing, with each building on concepts from previous approaches:

Clusters provided the foundation for resource pooling and unified management

Grids extended this to distributed resources across organizations

Clouds added virtualization, elasticity, and the utility model

While clouds have become dominant for many use cases, clusters and grids continue to serve specific purposes, especially in scientific and research computing.
Link to original

Feature	Clusters	Grids	Clouds
Ownership	Single organization	Multiple organizations	Service providers or organizations
Hardware	Homogeneous	Heterogeneous	Heterogeneous (abstracted)
Location	Co-located	Geographically distributed	Data centers (abstracted from users)
Management	Centralized	Distributed	Centralized for each provider
Scalability	Limited by physical resources	Limited by participating resources	Highly elastic (appears unlimited)
Access	Local network, specific interfaces	Grid middleware, certificates	Standard web protocols, APIs
Business Model	Capital expenditure	Collaborative	Operational expenditure (utility)
Virtualization	Limited	Limited	Extensive

Virtualization

Virtualization Fundamentals
Virtualization is the foundation that enables cloud computing by abstracting physical resources into logical units that can be provisioned on-demand.

Definition

According to NIST Special Publication 800-125:

“Virtualization is the simulation of the software and/or hardware upon which other software runs. This simulated environment is called a virtual machine (VM).”

In other words, virtualization creates an abstraction layer that transforms a real (physical) system so it appears as a different virtual system or as multiple virtual systems.

Key Concepts
- Host System: The physical hardware and software on which virtualization is implemented
- Guest System: The virtual system that runs on the host
- Hypervisor/VMM (Virtual Machine Monitor): Software that creates and manages virtual machines
Formal Definition

Virtualization can be formally defined through an isomorphism V that maps the guest state to the host state:
- For each sequence of operations e that modifies the guest’s state from Si to Sj
- There exists a corresponding sequence of operations e’ that performs an equivalent modification between the host’s states (S’i to S’j)
Categories of Virtualization

Virtualization technologies can be categorized into three main types:

1. Process Virtualization
- Creates a virtual environment for individual applications
- Examples: Java Virtual Machine (JVM), Common Language Runtime (.NET/Mono)
- Used for platform independence and sandboxing
2. OS-Level Virtualization
- Creates isolated environments (containers) within an operating system
- Examples: Linux Containment Features, Docker, FreeBSD Jails
- Used for application isolation and packaging
3. System Virtualization

Creates complete virtual machines with virtualized hardware
- Emulation: Complete software emulation of hardware (e.g., QEMU, Bochs)
- Full Virtualization: Virtualization where the guest OS is unmodified (e.g., VMware Workstation, VirtualBox)
- OS-Assisted Virtualization: Virtualization where the guest OS is modified to cooperate with the hypervisor (e.g., Xen)
- Hardware-Assisted Virtualization: Virtualization leveraging special CPU features (e.g., KVM, Hyper-V)
Types of Hypervisors

Type 1 (Bare-Metal Hypervisors)
- Run directly on hardware
- Examples: VMware ESXi, Xen, Microsoft Hyper-V, KVM
- More efficient, better performance
- Require special device drivers
Type 2 (Hosted Hypervisors)
- Run as an application on a host operating system
- Examples: VMware Workstation, Oracle VirtualBox, QEMU
- Less efficient but more flexible
- Can use the host OS device drivers
Importance in Cloud Computing

Virtualization is critical for cloud computing because it enables:
1. Resource pooling: Physical resources can be shared among multiple users
2. Isolation: Different users’ workloads can run on the same hardware without interfering with each other
3. Rapid provisioning: Virtual resources can be created, modified, or deleted quickly
4. Elasticity: The ability to scale resources up or down based on demand
5. Efficient resource utilization: Higher utilization rates of physical hardware
Challenges
- Performance overhead: Virtualization introduces some performance penalties
- Security concerns: Potential for VM escape vulnerabilities
- Resource management: Allocation and scheduling of resources among VMs
- Complexity: Additional layer in the system architecture
Link to original
Virtual Machines
A Virtual Machine (VM) is a software-based emulation of a physical computer that can run an operating system and applications as if they were running on physical hardware.

Definition

A virtual machine provides an environment that is logically separated from the underlying physical hardware. The hardware elements (CPU, memory, storage, network) presented to the VM are abstract and virtualized, allowing multiple VMs to share physical resources while maintaining isolation.

Key Components

Hypervisor (Virtual Machine Monitor)

The hypervisor is the software layer that enables the creation and management of virtual machines:
- Type 1 Hypervisors: Run directly on hardware (bare-metal)
  - Examples: VMware ESXi, Microsoft Hyper-V, Xen, KVM
  - More efficient, commonly used in data centers and cloud environments
- Type 2 Hypervisors: Run on top of a host operating system
  - Examples: VMware Workstation, Oracle VirtualBox, QEMU
  - Common for desktop virtualization and development environments
Guest Operating System

The operating system that runs inside the VM, which can be different from the host system.

Virtual Hardware

Virtualized components presented to the VM:
- Virtual CPUs (vCPUs)
- Virtual RAM
- Virtual Disks
- Virtual Network Interfaces
- Virtual I/O devices
VM Images

Templates containing the VM configuration and virtual disk content:
- Pre-configured operating systems and applications
- Stored as files on the host system
- Can be used to rapidly deploy new VMs
Virtualizability

For a system to be efficiently virtualized, certain conditions must be met. Popek and Goldberg’s theorem states:

“A virtual machine monitor may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.”

Where:
- Privileged instructions: Instructions that can only execute in system mode
- Sensitive instructions: Instructions that could affect system resources or behave differently based on system state
This theorem is the foundation for understanding the challenges in virtualizing architectures like x86.

Virtualization Approaches

Different approaches to virtualization have emerged to address architectural challenges:
1. Full Virtualization: Guest OS runs unmodified, unaware it’s being virtualized
  - May require techniques like binary translation to handle non-virtualizable instructions
2. OS-Assisted Virtualization: Guest OS is modified to cooperate with the hypervisor
  - Example: Xen paravirtualization
  - Better performance but requires modified guest OS
3. Hardware-Assisted Virtualization: Uses CPU extensions that support virtualization
  - Examples: Intel VT-x, AMD-V
  - Enables efficient virtualization with unmodified guest OSes
Use Cases for Virtual Machines
1. Running different operating systems than the host system
2. Operating multiple isolated environments on a single host
3. Resource pooling for multiple users and applications in private clouds
4. Infrastructure as a Service (IaaS) in public clouds like AWS EC2
Performance Considerations

Virtual machines introduce some overhead compared to bare-metal execution:
- CPU virtualization overhead
- Memory management overhead (especially with shadow page tables)
- I/O virtualization overhead
- Context switches between guest and hypervisor
VM Pausing vs Suspending

Suspending:
- Full VM state is written to disk, so only disk resources and networking resources remain required
- Resuming takes little time (way less than booting)
Pausing:
- Only the CPU activity is halted, so the VM does not run but does require main memory (and other resources)
- Resuming takes very little time (less than resuming a suspended VM)
Link to original
Full Virtualization
Full virtualization is a virtualization technique where the virtual machine simulates enough hardware to allow an unmodified guest operating system to run in isolation. In full virtualization, the guest OS is completely unaware that it is being virtualized and requires no modifications.

Key Characteristics
- Guest operating system runs unmodified
- No modifications to the guest OS source code or binaries
- Complete isolation between guest and host
- Higher resource overhead compared to other virtualization techniques
Challenges with x86 Architecture

The x86 architecture presented significant challenges for full virtualization because it doesn’t satisfy the Popek and Goldberg’s Theorem requirements:
- Some sensitive instructions don’t trap when executed in user mode
- These “critical instructions” prevent traditional trap-and-emulate virtualization
Binary Translation

To overcome these challenges, virtualization systems like VMware developed binary translation:

How Binary Translation Works
1. Dynamic Code Analysis:
  - The VMM analyzes the guest OS code at runtime
  - Identifies sequences of instructions (translation units)
  - Looks for critical instructions in these units
2. Code Replacement:
  - Critical instructions are replaced with alternative code that:
    - Achieves the same functionality
    - Allows the VMM to maintain control
    - May include explicit calls to the VMM
3. Translation Cache:
  - Modified code blocks are stored in a translation cache
  - Frequently executed code benefits from this caching
  - Translation is done lazily (only when needed)
4. Direct Execution:
  - Non-critical, unprivileged instructions run directly on the CPU
  - This minimizes performance overhead for regular code
Memory Management in Full Virtualization

Shadow Page Tables

To handle memory virtualization, full virtualization uses shadow page tables:
1. Guest OS maintains its own page tables (logical to “physical” mapping)
2. VMM maintains shadow page tables (logical to actual physical mapping)
3. When guest modifies its page tables, operations trap to the VMM
4. VMM updates shadow page tables accordingly
5. The hardware MMU uses the shadow page tables for actual translation
This creates two levels of address translation:
- Guest virtual address → Guest physical address
- Guest physical address → Host physical address
Shadow page tables combine these translations for efficiency.

I/O Virtualization in Full Virtualization

Several approaches exist for I/O virtualization:
1. Device Emulation:
  - VMM presents virtual devices to the guest
  - Common devices emulated include disk controllers, network cards, etc.
  - Guest uses standard drivers for these virtual devices
2. Device Driver Interception:
  - VMM intercepts calls to virtual device drivers
  - Redirects to corresponding physical devices
3. Device Passthrough:
  - Direct assignment of physical devices to VMs
  - Requires hardware support (IOMMU)
  - Offers better performance but limits device sharing
Performance Implications

Full virtualization has performance implications:
- CPU overhead for binary translation
- Memory overhead for shadow page tables
- I/O performance degradation due to interception and emulation
- High context switching overhead for privileged operations
Examples of Full Virtualization
- VMware Workstation
- Oracle VirtualBox
- Microsoft Virtual PC
- QEMU (when used without KVM)
Advantages and Disadvantages

Advantages
- No modification to guest OS required
- Can run any operating system designed for the same architecture
- Complete isolation between VMs
Disadvantages
- Performance overhead, especially for I/O operations
- Complex implementation (especially binary translation)
- Higher memory usage due to shadow page tables
Link to original
OS-Assisted Virtualization
OS-assisted virtualization, also known as paravirtualization, is a virtualization technique where the guest operating system is modified to be aware that it is running in a virtualized environment. This approach allows the guest OS to cooperate with the hypervisor to achieve better performance than full virtualization, especially on architectures that don’t perfectly satisfy Popek and Goldberg’s Theorem.

Key Concept

The fundamental idea of OS-assisted virtualization is to:

Make the guest OS aware that it is being virtualized and modify it to directly communicate with the hypervisor, avoiding the need for complex techniques like binary translation or hardware extensions.

How OS-Assisted Virtualization Works
1. The guest OS is modified to replace non-virtualizable instructions with explicit calls to the hypervisor (hypercalls)
2. The guest OS is aware it doesn’t have direct access to physical hardware
3. The hypervisor provides an API that the modified guest OS uses for privileged operations
4. The guest still maintains its device drivers, memory management, and process scheduling, but in coordination with the hypervisor
Xen: A Classic Example

Xen is the most well-known example of OS-assisted virtualization:

Xen Architecture
- Hypervisor: A thin layer running directly on hardware (Type 1)
- Domain 0 (dom0): Privileged guest for control and management
- Domain U (domU): Unprivileged guest domains with Xen-aware OS
Xen uses a ring structure for privileges:
- Hypervisor runs in Ring 0 (most privileged)
- Guest OS kernels run in Ring 1
- Guest applications run in Ring 3 (least privileged)
CPU Virtualization in Xen
- Guest OS is modified to run in Ring 1 instead of Ring 0
- Critical instructions are replaced with hypercalls
- Hypercalls are explicit calls from the guest OS to the hypervisor
- System calls from applications to the guest OS can sometimes bypass the hypervisor for better performance
Memory Management in Xen

Xen’s approach to memory management is distinctive:
- Physical memory is statically partitioned among domains at creation time
- Each domain is aware of its physical memory allocation
- Domains maintain their own page tables, validated by the hypervisor
- The guest page tables are used directly by the hardware MMU
- Updates to page tables require hypervisor validation to ensure isolation
- No shadow page tables are needed (unlike in Full Virtualization)
I/O Virtualization in Xen

Xen provides virtual devices through a split-driver model:
- Front-end drivers in guest domains (domU)
- Back-end drivers in the privileged domain (dom0)
- Communication through shared memory and event channels
- Physical device drivers reside in dom0
Performance Advantages

OS-assisted virtualization offers several performance advantages:
1. No need for binary translation or instruction emulation
2. Direct memory management without shadow page tables
3. More efficient I/O through paravirtualized drivers
4. Reduced context switching overhead
5. Explicit cooperation between guest and hypervisor
Limitations

Despite its performance benefits, OS-assisted virtualization has limitations:
1. Requires guest OS modifications: Source code access and modification is necessary
2. Limited OS support: Only OSes that have been specifically modified can run
3. Maintenance burden: Modified OSes must be maintained separately from mainline versions
4. Porting effort: Each new OS version requires porting effort
Comparison with Other Approaches

When compared to other virtualization techniques:
- vs. Full Virtualization: Better performance but requires OS modifications
- vs. Hardware-Assisted Virtualization: Similar performance in some cases, but doesn’t require specialized hardware
Link to original
Hardware-Assisted Virtualization
Hardware-assisted virtualization refers to virtualization techniques that leverage special processor features designed specifically to support virtual machines. These hardware extensions were introduced to overcome the limitations of x86 architecture that made it difficult to efficiently virtualize according to Popek and Goldberg’s Theorem.

Background

The classic x86 architecture contained about 17 “critical instructions” (sensitive but not privileged) that prevented efficient virtualization. To address this issue, both Intel and AMD independently developed hardware virtualization extensions:
- Intel VT-x (Intel Virtualization Technology for x86)
- AMD-V (AMD Virtualization)
These technologies were introduced in 2005-2006 and have since evolved to include more advanced features.

IA-32:

Core Concepts

CPU Virtualization Extensions

The primary innovation in hardware-assisted virtualization is the introduction of new CPU modes:
- Root Mode: Where the VMM/hypervisor runs
- Non-root Mode: Where guest OSes run (called “guest mode”)
This creates a higher privilege level for the hypervisor than even Ring 0, allowing guest OSes to run in their expected privilege rings while still being controlled by the hypervisor.

The transitions between these modes are:
- VM Entry: Transition from root mode to non-root mode
- VM Exit: Transition from non-root mode to root mode
VMM Control Structures

The CPU maintains control structures for each virtual machine:
- Intel VMCS (Virtual Machine Control Structure)
- AMD VMCB (Virtual Machine Control Block)
These structures contain:
- Guest state (register values, control registers, etc.)
- Host state (to be restored on VM Exit)
- Execution controls (what events cause VM Exits)
- Exit information (why a VM Exit occurred)
Key Mechanisms
1. Control Registers:
  - Special CPU registers that determine VM Exit conditions
  - Allow fine-grained control over which events trap to the hypervisor
2. Extended Page Tables / Nested Page Tables:
  - Intel EPT / AMD NPT
  - Hardware support for two-level address translation
  - Eliminates shadow page table overhead
3. Tagged TLBs:
  - Associate TLB entries with specific address spaces
  - Avoid TLB flushes on context switches between VMs
4. IOMMU (I/O Memory Management Unit):
  - Intel VT-d / AMD-Vi
  - Provides DMA remapping and interrupt remapping
  - Enables safe direct device assignment to VMs
Memory Virtualization Extensions

One significant advancement in hardware-assisted virtualization is the support for nested paging:

Extended Page Tables (EPT) / Nested Page Tables (NPT)
- Hardware manages two levels of address translation:
  - Guest Virtual Address → Guest Physical Address
  - Guest Physical Address → Host Physical Address
- Translation performed in hardware rather than software
- Significantly reduces VMM interventions for memory operations
- Eliminates the need for shadow page tables
I/O Virtualization Extensions

Hardware-assisted I/O virtualization focuses on enabling direct device assignment:

IOMMU (I/O Memory Management Unit)
- Allows VMs to directly access hardware devices
- Provides memory protection from DMA operations
- Handles interrupt routing to appropriate VMs
- Enables SR-IOV (Single Root I/O Virtualization)
Performance Benefits

Hardware-assisted virtualization offers several performance advantages:
1. Reduced VMM intervention:
  - Critical instructions automatically trap to the hypervisor
  - No need for binary translation
2. Efficient memory management:
  - Hardware-accelerated address translation
  - No overhead of shadow page tables
3. Direct I/O access:
  - Near-native I/O performance
  - Reduced overhead for I/O-intensive workloads
4. Lower context switching cost:
  - Hardware-assisted state transitions between host and guest
Examples of Hardware-Assisted Virtualization

Several hypervisors leverage these hardware extensions:
- KVM (Kernel-based Virtual Machine)
- Microsoft Hyper-V
- VMware ESXi (in addition to other techniques)
- Xen (when running unmodified guests)
Advantages and Disadvantages

Advantages
- Unmodified guest OSes can run efficiently
- Significantly better performance than pure software virtualization
- Near-native performance for many workloads
- Simplified hypervisor implementation
Disadvantages
- Requires specific hardware support
- Different implementations between CPU vendors
- Older hardware lacks these extensions
- Still some overhead compared to native execution
Link to original

VMs vs Containers
Virtual Machines (VMs) and containers are both virtualization technologies that enable software to run in isolated environments, but they differ significantly in their architecture, resource usage, performance characteristics, and use cases.

Architectural Differences

Virtual Machines

Level of Virtualization: Hardware-level virtualization

Components:

Hypervisor (VMM) running on physical hardware

Complete guest OS for each VM

Virtualized hardware for each VM

Applications running on the guest OS

Isolation: Strong isolation at the hardware level

Resource Allocation: Dedicated virtual hardware resources

Containers

Level of Virtualization: OS-level virtualization

Components:

Host OS running on physical hardware

Container runtime (e.g., Docker)

Application and its dependencies

Shared OS kernel

Isolation: Process-level isolation using OS features (namespaces, cgroups)

Resource Allocation: Shared OS kernel, isolated user space

Performance Comparison

Based on benchmarking studies, containers and VMs show different performance characteristics across several dimensions:

CPU Performance

Both VMs and containers show minimal overhead for CPU-intensive workloads (1-5%)

VMs may have slightly higher overhead due to virtualization layer

Memory Access

Containers: Near-native memory access performance

VMs: Similar random access performance but slightly lower sequential access bandwidth

Memory management overhead is higher in VMs due to virtualized memory management units and shadow page tables

Network Performance

Containers: Lower latency and higher throughput than VMs

VMs: Additional overhead due to virtual network devices

Docker NAT can increase latency for containers

Disk I/O

Containers: Better I/O performance than VMs, especially for random I/O

VMs: Higher latency due to virtual I/O devices

Both have similar throughput for sequential operations

Boot Time

Containers: Start in seconds (typically 1-5 seconds)

VMs: Start in minutes (typically 30-60 seconds)

Resource Overhead

Containers: Minimal overhead (MBs)

VMs: Significant overhead (GBs for each VM)

Image Size & Startup Time

Image Size

VM Images:

Typically gigabytes in size (e.g., 5-20GB)

Contain entire operating system

Include all libraries and binaries

Container Images:

Typically megabytes in size (e.g., 10-300MB)

Only include application and dependencies

Share the host OS kernel

Startup Time

VM Startup:

Operating system boot process

Initialization of all OS services

Typically takes 30+ seconds

Container Startup:

No OS boot required

Application process start only

Typically takes milliseconds to seconds

Isolation & Security

Virtual Machines

Stronger Isolation: Complete separation at hardware level

Security Benefits:

Hardware-enforced boundaries

Separate kernel instances

Vulnerabilities in one VM don’t affect others

Hypervisor provides additional security layer

Attack Surface:

Smaller attack surface (hypervisor code is much smaller than OS kernel)

VM escape vulnerabilities are rare

Containers

Weaker Isolation: Process-level isolation within same OS

Security Concerns:

Shared kernel between containers

Container escape risks

Root privileges in container could potentially affect host

Mitigation Techniques:

User namespaces

Seccomp profiles

AppArmor/SELinux policies

Non-root users in containers

Read-only filesystems

Use Cases

Virtual Machines Excel For

Running Different Operating Systems: e.g., Windows on Linux host

Strong Security Requirements: Regulatory compliance, multi-tenant environments

Traditional Monolithic Applications: Legacy applications

Kernel-Level Customization: Custom kernel modules or settings

Hardware-Level Features: Direct access to specialized hardware

Containers Excel For

Microservices Architecture: Multiple small, independent services

DevOps Workflows: CI/CD pipelines, rapid deployment

Application Packaging: Consistent environments from dev to production

High-Density Applications: Maximizing resource utilization

Stateless Applications: Web servers, API endpoints

Short-Lived Processes: Batch jobs, serverless workloads

Managing Both Technologies

VM Management

Hypervisors: VMware ESXi, KVM, Hyper-V, Xen

Cloud Platforms: AWS EC2, Azure VMs, Google Compute Engine

Operations: VM migration, snapshots, templates

Container Management

Container Runtimes: Docker, containerd, CRI-O

Orchestration: Kubernetes, Docker Swarm, Amazon ECS

Operations: Container lifecycle, image management, networking

Comparison Table

Feature Virtual Machines Containers
Virtualization Level Hardware Operating System
Size Gigabytes Megabytes
Boot Time Minutes Seconds
Performance Overhead Higher Lower
Isolation Strong Moderate
Resource Efficiency Lower Higher
OS Diversity Any OS supported by hardware Same OS kernel as host
Security Strong isolation Process-level isolation
Portability Less portable (hypervisor-specific) Highly portable
Density Dozens per host Hundreds or thousands per host
Persistent Data Built-in storage Requires volumes
Maturity Very mature Rapidly maturing

Hybrid Approaches

VM-based Containers

Container hosts running inside VMs

Benefits of both technologies

Common in cloud environments

Example: Kubernetes clusters on VMs in the cloud

Kata Containers

Containers running in lightweight VMs

Container interface with VM isolation

Compatible with container ecosystems

Firecracker

Lightweight VMM for serverless containers

Combines VM security with container startup time

Used in AWS Lambda and Fargate

Making the Right Choice

Consider these factors when choosing between VMs and containers:

Security Requirements: Level of isolation needed

Performance Needs: Resource overhead considerations

Application Architecture: Monolithic vs. microservices

Operational Complexity: Team expertise and tooling

Portability Requirements: Cross-platform needs

Resource Constraints: Available hardware resources

Development Workflow: Integration with CI/CD

Link to original

Feature	Virtual Machines	Containers
Virtualization Level	Hardware	Operating System
Size	Gigabytes	Megabytes
Boot Time	Minutes	Seconds
Performance Overhead	Higher	Lower
Isolation	Strong	Moderate
Resource Efficiency	Lower	Higher
OS Diversity	Any OS supported by hardware	Same OS kernel as host
Security	Strong isolation	Process-level isolation
Portability	Less portable (hypervisor-specific)	Highly portable
Density	Dozens per host	Hundreds or thousands per host
Persistent Data	Built-in storage	Requires volumes
Maturity	Very mature	Rapidly maturing

Containers

Container Fundamentals
Containers are a lightweight form of virtualization that package an application and its dependencies into a standardized unit for software development and deployment. Unlike virtual machines, containers virtualize at the operating system level rather than at the hardware level.

Definition

Containers, also known as OS-level virtualization, provide isolated environments for running application processes within a shared operating system kernel. They encapsulate an application with its runtime, system tools, libraries, and settings needed to run, ensuring consistency across different environments.

Key Concepts

Container vs. Virtual Machine

A container differs fundamentally from a virtual machine:
- Resource Utilization: Containers share the host OS kernel, making them more lightweight
- Isolation Level: Containers isolate at the process level; VMs isolate at the hardware level
- Startup Time: Containers start in seconds; VMs typically take minutes
- Image Size: Container images are typically megabytes; VM images are gigabytes
- Portability: Containers provide consistent runtime regardless of underlying infrastructure
Container Images

A container image is a lightweight, standalone, executable package that includes everything needed to run an application:
- Application code
- Runtime environment
- System libraries
- Default settings
Images are built in layers, which are cached and reused across containers to optimize storage and transfer efficiency.

Container Instances

A container instance is a running copy of a container image. Multiple instances can run from the same image simultaneously, each with its own isolated environment.

Evolution of Containerization

Early Isolation Mechanisms
- chroot (1979): The first UNIX mechanism for isolating a process’s file system view
- FreeBSD Jails (2000): Extended isolation to include processes, networking, and users
- Solaris Zones (2004): Similar isolation capabilities for Solaris
Modern Container Technologies
- LXC (2008): Linux Containers using kernel containment features
- Docker (2013): Made containers accessible with simplified tooling and images
- rkt/Rocket (2014): Alternative container runtime with focus on security
- Podman (2018): Daemonless container engine compatible with Docker
Core Technologies Behind Containers

Containers rely on several Linux kernel features for isolation:

Namespaces

Namespaces isolate a process’s view of the system, limiting what it can see and access:
- PID Namespace: Process isolation (each container has its process tree)
- NET Namespace: Network isolation (separate network interfaces)
- MNT Namespace: Mount point isolation (separate file system view)
- UTS Namespace: Hostname isolation
- IPC Namespace: Inter-process communication isolation
- USER Namespace: User and group ID isolation
Control Groups (cgroups)

Control groups limit and account for resource usage:
- CPU allocation
- Memory allocation
- Block I/O bandwidth
- Network bandwidth
- Device access
Union File Systems

Layered file systems that enable efficient image building and sharing:
- OverlayFS
- AUFS (Advanced Multi-Layered Unification Filesystem)
- Device Mapper
- BTRFS
Container Runtimes and Engines

A container runtime is the software responsible for running containers:
- Low-level runtimes: Execute containers (e.g., runc, crun)
- High-level runtimes: Manage images and abstract low-level runtimes (e.g., containerd)
- Container engines: Provide user interfaces for container management (e.g., Docker, Podman)
Use Cases for Containers

Containers are particularly well-suited for:
1. Microservices Architecture: Deploying independent, loosely coupled services
2. DevOps and CI/CD: Consistent environments across development, testing, and production
3. Application Packaging: Bundling applications with dependencies
4. Resource Efficiency: Running multiple workloads on the same host
5. Cloud-Native Applications: Building scalable, resilient applications
Benefits of Containers
- Portability: Run anywhere the container runtime is available
- Consistency: Same environment from development to production
- Efficiency: Less overhead than VMs, better resource utilization
- Speed: Fast startup and shutdown times
- Scalability: Easy to scale up or down
- Isolation: Application-level isolation without full virtualization overhead
Limitations of Containers
- Kernel Sharing: All containers share the host kernel
- Security: Generally less isolated than VMs
- Complex State Management: Stateful applications require additional considerations
- Cross-Platform Compatibility: Limited across different OS kernels
Link to original
Linux Containment Features
The Linux kernel includes several mechanisms that enable process isolation and resource control, which collectively form the foundation for container technologies. These containment features allow for efficient OS-level virtualization without the overhead of full system virtualization.

Core Containment Mechanisms

1. chroot

The chroot system call, introduced in 1979 in UNIX Version 7, is the oldest isolation mechanism and a precursor to modern containerization:
- Changes the apparent root directory for a process and its children
- Limits a process’s view of the file system
- Isolates file system access but doesn’t provide complete isolation
- Used primarily for security and creating isolated build environments
```
# Example: Changing root directory for a process
sudo chroot /path/to/new/root command
```
2. Namespaces

Namespaces partition kernel resources so that one set of processes sees one set of resources while another set of processes sees a different set. Linux includes several types of namespaces:

PID Namespace
- Isolates process IDs
- Each namespace has its own process numbering, starting at PID 1
- Processes in a namespace can only see other processes in the same namespace
- Enables container restart without affecting other containers
Network Namespace
- Isolates network resources
- Each namespace has its own:
  - Network interfaces
  - IP addresses
  - Routing tables
  - Firewall rules
  - Port numbers
Mount Namespace
- Isolates filesystem mount points
- Each namespace has its own view of the filesystem hierarchy
- Changes to mounts in one namespace don’t affect others
- Fundamental for container filesystem isolation
UTS Namespace
- Isolates hostname and domain name
- Allows each container to have its own hostname
- Named after UNIX Time-sharing System
IPC Namespace
- Isolates Inter-Process Communication resources
- Isolates System V IPC objects and POSIX message queues
- Prevents processes in different namespaces from communicating via IPC
User Namespace
- Isolates user and group IDs
- A process can have root privileges within its namespace while having non-root privileges outside
- Enhances container security
Time Namespace
- Introduced in newer kernel versions
- Allows containers to have their own system time
3. Control Groups (cgroups)

Control groups, or cgroups, provide mechanisms for:
- Limiting resource usage (CPU, memory, I/O, network, etc.)
- Prioritizing resource allocation
- Measuring resource usage
- Controlling process lifecycle
Cgroups organize processes hierarchically and distribute system resources along this hierarchy:

Cgroup Subsystems (Controllers)
- cpu: Limits CPU usage
- memory: Limits memory usage and reports memory resource usage
- blkio: Limits block device I/O
- devices: Controls access to devices
- net_cls: Tags network packets for traffic control
- freezer: Suspends and resumes processes
- pids: Limits process creation
4. Capabilities

Linux capabilities divide the privileges traditionally associated with the root user into distinct units that can be independently enabled or disabled:
- Allows for fine-grained control over privileged operations
- Reduces the security risks of running processes as root
- Examples of capabilities:
  - CAP_NET_ADMIN: Configure networks
  - CAP_SYS_ADMIN: Perform system administration operations
  - CAP_CHOWN: Change file ownership
5. Security Modules

Linux includes several security modules that can enhance container isolation:

SELinux (Security-Enhanced Linux)
- Provides Mandatory Access Control (MAC)
- Defines security policies that constrain processes
- Labels files, processes, and resources, controlling interactions based on these labels
AppArmor
- Path-based access control
- Restricts programs’ capabilities using profiles
- Simpler to configure than SELinux, used by default in Ubuntu
Seccomp (Secure Computing Mode)
- Filters system calls available to a process
- Prevents processes from making unauthorized system calls
- Can be used with a whitelist or blacklist approach to control system call access
```
# Example: Activating seccomp profile in Docker
docker run --security-opt seccomp=/path/to/profile.json image_name
```
Implementation in Container Technologies

These Linux kernel features are used by container runtimes in various combinations:
- LXC: Utilizes all these features directly with a focus on system containers
- Docker: Builds upon these features with additional tooling and image management
- Podman: Similar to Docker but with a focus on rootless containers using user namespaces
- Kubernetes/CRI-O: Uses these features via container runtimes like containerd or CRI-O
Limitations and Considerations

Despite these isolation mechanisms, some limitations remain:
1. Kernel Sharing: All containers share the host kernel, which means:
  - Kernel vulnerabilities affect all containers
  - Containers cannot run a different OS kernel than the host
2. Resource Contention: Without proper cgroup configurations, noisy neighbors can still impact performance
3. Security Concerns: Container escape vulnerabilities can potentially compromise the host
Link to original
Docker
Docker is a leading containerization platform that simplifies the process of creating, deploying, and running applications in containers. Released in 2013, Docker revolutionized application deployment by making container technology accessible and standardized.

Core Concepts

Docker Architecture

Docker uses a client-server architecture consisting of:
1. Docker Client: The primary user interface to Docker
2. Docker Daemon (dockerd): A persistent process that manages Docker containers
3. Docker Registry: A repository for Docker images (e.g., Docker Hub)
Docker Components

Docker Engine

The Docker Engine is the core of Docker, comprising:
- Docker daemon: Runs in the background and handles container operations
- REST API: Provides an interface for the client to communicate with the daemon
- Command-line interface (CLI): The user interface for Docker commands
Docker Images

A Docker image is a read-only template containing a set of instructions for creating a Docker container:
- Built in layers, with each layer representing a set of filesystem changes
- Defined in a Dockerfile
- Stored in a registry (e.g., Docker Hub or private registry)
- Immutable: once built, the image doesn’t change
Docker Containers

A container is a runnable instance of an image:
- Isolated environment for running applications
- Contains everything needed to run the application (code, runtime, libraries, etc.)
- Shares the host OS kernel but is isolated at the process level
Docker Image Format

Docker images use a layered architecture that provides several benefits:
- Efficient storage: Layers are cached and reused across images
- Faster transfers: Only new or modified layers need to be transferred
- Version control: Each layer represents a change, enabling versioning
Image Layers

An image consists of multiple read-only layers, each representing a set of filesystem changes:
1. Base layer: Usually a minimal OS distribution
2. Additional layers: Each layer adds, modifies, or removes files from the previous layer
3. Container layer: When a container runs, a writable layer is added on top
Content Addressable Storage

Docker uses content-addressable storage for images:
- Each layer is identified by a hash of its contents
- Ensures image integrity and enables deduplication
- Allows deterministic builds and reproducibility
Dockerfiles

A Dockerfile is a text file containing instructions for building a Docker image:
```
# Example Dockerfile
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y nginx
COPY ./my-nginx.conf /etc/nginx/nginx.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
```
Common Dockerfile Instructions
- FROM: Specifies the base image
- RUN: Executes commands in a new layer
- COPY/ADD: Copies files from the build context into the image
- WORKDIR: Sets the working directory
- ENV: Sets environment variables
- EXPOSE: Documents the ports the container will listen on
- VOLUME: Creates a mount point for external volumes
- ENTRYPOINT: Configures the executable to run when the container starts
- CMD: Provides default arguments for the ENTRYPOINT
Docker Commands

Basic Commands
```
# Build an image
docker build -t myapp:1.0 .
 
# Run a container
docker run -d -p 8080:80 myapp:1.0
 
# List running containers
docker ps
 
# Stop a container
docker stop container_id
 
# Remove a container
docker rm container_id
 
# List images
docker images
 
# Remove an image
docker rmi image_id
```
Advanced Commands
```
# Inspect a container
docker inspect container_id
 
# View container logs
docker logs container_id
 
# Execute a command in a running container
docker exec -it container_id bash
 
# Create a new image from a container
docker commit container_id new_image_name:tag
 
# Push an image to a registry
docker push username/repository:tag
```
Docker Compose

Docker Compose is a tool for defining and running multi-container Docker applications:
- Uses a YAML file to configure application services
- Enables managing multiple containers as a single application
- Simplifies development and testing workflows
Example docker-compose.yml
```
version: '3'
services:
  web:
    build: ./web
    ports:
      - "8080:80"
    depends_on:
      - db
  db:
    image: postgres:13
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_PASSWORD: example
      POSTGRES_USER: user
      POSTGRES_DB: mydb
volumes:
  postgres_data:
```
Docker Networking

Docker provides several network drivers for container communication:
- bridge: Default network driver, allows containers on the same host to communicate
- host: Removes network isolation, container uses host’s network
- overlay: Connects multiple Docker daemons together
- macvlan: Assigns a MAC address to containers, making them appear as physical devices
- none: Disables all networking
Docker Volumes

Volumes provide persistent storage for containers:
- Bind mounts: Map a host directory to a container directory
- Named volumes: Managed by Docker, more portable
- tmpfs mounts: Stored in host memory, temporary storage
Docker Security Considerations

Docker containers provide some isolation, but security requires attention:
- Running containers as non-root users
- Using security profiles (e.g., seccomp, AppArmor)
- Regularly updating base images
- Using Docker Content Trust for image signing
- Minimizing container capabilities
- Scanning images for vulnerabilities
Advantages of Docker
- Consistency: Same environment from development to production
- Isolation: Applications run in isolated environments
- Portability: Run anywhere Docker is installed
- Efficiency: Lightweight compared to VMs
- Version Control: Image layers enable tracking changes
- Scalability: Easy to scale containers horizontally
Limitations of Docker
- Stateless by design: Requires extra consideration for stateful applications
- Kernel sharing: All containers share the host kernel
- Security concerns: Container isolation is not as strong as VM isolation
- Complexity: Container orchestration adds complexity
Link to original
Container Orchestration
Container orchestration automates the deployment, management, scaling, and networking of containers. As applications grow in complexity and scale, manually managing individual containers becomes impractical, making orchestration essential for production container deployments.

What is Container Orchestration?

Container orchestration refers to the automated arrangement, coordination, and management of containers. It handles:
- Provisioning and deployment of containers
- Resource allocation
- Load balancing across multiple hosts
- Health monitoring and automatic healing
- Scaling containers up or down based on demand
- Service discovery and networking
- Rolling updates and rollbacks
Why Container Orchestration is Needed

Challenges of Manual Container Management
- Scale: Managing hundreds or thousands of containers manually is impossible
- Complexity: Multi-container applications have complex dependencies
- Reliability: Manual intervention increases the risk of errors
- Resource Utilization: Optimal placement of containers requires sophisticated algorithms
- High Availability: Fault tolerance requires automated monitoring and recovery
Benefits of Container Orchestration
- Automated Operations: Reduces manual intervention and human error
- Optimal Resource Usage: Intelligent scheduling of containers
- Self-healing: Automatic recovery from failures
- Scalability: Easy horizontal scaling
- Declarative Configuration: Define desired state rather than imperative steps
- Service Discovery: Automatic linking of interconnected components
- Load Balancing: Distribution of traffic across container instances
- Rolling Updates: Zero-downtime deployments
Core Concepts in Container Orchestration

Master
- Collection of processes managing the cluster state on a single node of the cluster
- Controllers, e.g. replication and scaling controllers
- Scheduler: places pods based on resource requirements, hardware and software constraints, data locality, deadlines…
- etcd: reliable distributed key-value store, used for the cluster state
Cluster

A collection of host machines (physical or virtual) that run containerized applications managed by the orchestration system.

Node

An individual machine (physical or virtual) in the cluster that can run containers.

Container

The smallest deployable unit, running a single application or process.

Pod

In Kubernetes, a group of one or more containers that share storage and network resources and a specification for how to run the containers.

Service

An abstraction that defines a logical set of pods and a policy to access them, often used for load balancing and service discovery.

Desired State

The specification of how many instances should be running, what version they should be, and how they should be configured.

Reconciliation Loop

The process by which the orchestration system continuously works to make the current state match the desired state.

Key Features of Orchestration Platforms

Scheduling
- Placement Strategies: Determining which node should run each container
- Affinity/Anti-affinity Rules: Controlling which containers should or shouldn’t run together
- Resource Constraints: Considering CPU, memory, and storage requirements
- **Taints and Toler
Link to original
Kubernetes
Kubernetes (often abbreviated as K8s) is an open-source container orchestration platform designed to automate deploying, scaling, and managing containerized applications. Originally developed by Google and now maintained by the Cloud Native Computing Foundation (CNCF), Kubernetes has become the de facto standard for container orchestration.

History and Background
- Origin: Developed by Google based on their internal system called Borg
- Release: Open-sourced in 2014
- Name: Greek for “helmsman” or “pilot” (hence the ship’s wheel logo)
- CNCF: Became the first graduated project of the Cloud Native Computing Foundation in 2018
Core Concepts

Kubernetes Architecture

Kubernetes follows a master-worker (also called control plane and node) architecture:

Control Plane Components
- API Server: Front-end for the Kubernetes control plane, exposing the Kubernetes API
- etcd: Consistent and highly-available key-value store for all cluster data
- Scheduler: Watches for newly created pods with no assigned node and selects nodes for them to run on
- Controller Manager: Runs controller processes that regulate the state of the cluster
- Cloud Controller Manager: Links the cluster to cloud provider APIs
Node Components
- Kubelet: An agent that runs on each node, ensuring containers are running in a pod
- Kube-proxy: Network proxy that maintains network rules on nodes
- Container Runtime: Software responsible for running containers (e.g., Docker, containerd, CRI-O)
Kubernetes Objects

Pods

The smallest deployable units in Kubernetes:
- Group of one or more containers with shared storage/network resources
- Ephemeral (not designed to survive failures)
- Should be managed by higher-level controllers, not directly
```
apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80
```
Deployments

Controllers for creating and updating instances of your applications:
- Define desired state for your application
- Handle rolling updates and rollbacks
- Manage ReplicaSets
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
```
Services

An abstraction to expose applications running on pods:
- Provides stable network endpoint
- Enables load balancing
- Facilitates service discovery
Types of services:
- ClusterIP: Internal only (default)
- NodePort: Exposes on each node’s IP at a static port
- LoadBalancer: Exposes externally using cloud provider’s load balancer
- ExternalName: Maps service to DNS name
```
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP
```
StatefulSets

Manages the deployment and scaling of a set of pods with persistent identities:
- Stable, unique network identifiers
- Stable, persistent storage
- Ordered, graceful deployment and scaling
- Used for stateful applications (databases, etc.)
DaemonSets

Ensures all (or some) nodes run a copy of a pod:
- Used for node monitoring, log collection
- Useful for cluster-wide services (e.g., networking plugins)
- Automatically adds pods to new nodes
ConfigMaps and Secrets

For configuration and sensitive data:
- ConfigMaps: Store non-confidential configuration data
- Secrets: Store sensitive information (passwords, tokens, keys)
Namespaces

Virtual clusters inside a physical cluster:
- Provide scope for names
- Allow resource quotas
- Enable multi-tenant environments
Kubernetes Networking

Kubernetes networking addresses four concerns:
1. Container-to-container communication: Solved by pods and localhost communications
2. Pod-to-pod communication: Flat network space where pods can communicate with all other pods
3. Pod-to-service communication: Through kube-proxy and virtual IPs
4. External-to-internal communication: Through services of type NodePort, LoadBalancer, or Ingress resources
Network Policies

Specifications of how groups of pods are allowed to communicate:
- Similar to network firewalls
- Restrict traffic to/from pods based on rules
Storage in Kubernetes

Kubernetes provides several abstractions for persistent storage:

Volumes

Basic building block for storage that outlives containers:
- Many volume types (e.g., emptyDir, hostPath, nfs, cloud provider volumes)
- Mounted into pods
Persistent Volumes (PV) and Persistent Volume Claims (PVC)

Decouple storage provisioning from usage:
- PV: Cluster resource provisioned by administrator or dynamically
- PVC: Request for storage by a user
- Storage Classes: Define types of storage and provisioners
Resource Management

Kubernetes provides mechanisms for resource control:

Resource Requests and Limits
- Requests: Minimum resources guaranteed to the container
- Limits: Maximum resources a container can use
```
resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"
```
Horizontal Pod Autoscaler (HPA)

Automatically scales the number of pods based on observed metrics:
- CPU utilization
- Memory usage
- Custom metrics
```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
```
Vertical Pod Autoscaler (VPA)

Automatically adjusts resource requests and limits for containers:
- Recommends and can automatically update resource configurations
- Helps right-size container resources
Kubernetes Extensions and Ecosystem

Helm

The package manager for Kubernetes:
- Templates for Kubernetes resources
- Manages releases of applications
- Facilitates sharing applications through Helm charts
Operators

Pattern for encoding domain knowledge into Kubernetes:
- Custom controllers that extend Kubernetes API
- Manage complex applications like databases, monitoring systems
- Automate operational tasks
Service Meshes

Infrastructure layer for service-to-service communication:
- Examples: Istio, Linkerd, Consul
- Provide traffic management, security, observability
- Decouple application code from network functionality
Ingress Controllers

Manage external access to services:
- Examples: Nginx Ingress, Traefik, HAProxy
- Implement HTTP routing rules
- Often provide SSL termination
Kubernetes Deployment Options

Self-Managed
- Kubeadm: Tool for creating Kubernetes clusters
- kubespray: Ansible playbooks for deploying Kubernetes
- kOps: Kubernetes Operations, production-grade tooling
- Minikube: Local Kubernetes for development
Managed Services
- Amazon EKS: Elastic Kubernetes Service
- Google GKE: Google Kubernetes Engine
- Azure AKS: Azure Kubernetes Service
- DigitalOcean DOKS: DigitalOcean Kubernetes
- IBM Cloud Kubernetes Service
- Oracle Container Engine for Kubernetes
Advantages of Kubernetes
- Portability: Run applications consistently across environments
- Scalability: Automatic scaling based on demand
- High Availability: Self-healing, automatic placement
- Extensibility: API-driven, customizable with CRDs
- Service Discovery: Built-in DNS and load balancing
- Rolling Updates: Zero-downtime deployments
- Secret Management: Secure handling of sensitive data
Challenges and Considerations
- Complexity: Steep learning curve
- Resource Overhead: Control plane requires resources
- Stateful Applications: More complex to manage
- Security: Requires careful configuration
- Observability: Needs additional tooling for monitoring
Link to original

Cloud Infrastructure Management

Cloud Operating Systems
Cloud operating systems are software platforms that manage large pools of compute, storage, and networking resources in a data center, providing interfaces for both administrators and users. They serve as the foundation for Infrastructure as a Service (IaaS) cloud offerings, abstracting underlying hardware complexities and enabling the provisioning of virtual resources.

Purpose and Function

Cloud operating systems serve several key functions:
1. Resource Virtualization: Abstract physical hardware into virtual resources
2. Resource Management: Allocate and track usage of compute, storage, and networking resources
3. Multi-tenancy: Enable secure sharing of physical infrastructure among multiple users
4. User Interface: Provide dashboards and APIs for cloud administrators and end users
5. Automation: Enable programmatic control over infrastructure components
Key Components and Features

Core Functionality
- Compute Management: Creation and management of virtual machines
- Storage Management: Provisioning of virtual disks and object storage
- Network Management: Virtual networks, subnets, firewalls, load balancers
- Image Management: Storage and versioning of VM and container images
- User Management: Authentication, authorization, and accounting (AAA)
- Metering and Billing: Resource usage tracking and chargeback
- Monitoring and Logging: Health monitoring and performance metrics
Advanced Functionality
- Orchestration: Coordinating the deployment of complex multi-component applications
- Auto-scaling: Dynamically adjusting resource allocations based on load
- High Availability: Ensuring service continuity during hardware failures
- Load Balancing: Distributing workloads across resources
- Service Catalog: Self-service portal for provisioning standardized resources
- Workflow Automation: Defining and executing operational procedures
Architecture of Cloud Operating Systems

Most cloud operating systems follow a modular architecture with several specialized components:

Control Plane
- API Server: Provides programmable interface for resource management
- Authentication Service: Handles user identity and access control
- Scheduler: Determines optimal placement of workloads
- Resource Manager: Tracks available and allocated resources
- Monitoring System: Collects performance metrics and health data
- Database: Stores system state and configuration
Data Plane
- Compute Hosts: Physical servers running hypervisors or container runtimes
- Storage Hosts: Servers providing block, file, or object storage
- Network Hosts: Servers handling network functions (routing, firewalls)
- Controller Host: Centralized management system
OpenStack: A Leading Open Source Cloud OS

OpenStack is one of the most widely deployed open-source cloud operating systems:

Core OpenStack Components
1. Nova (Compute Service):
  - Creates and manages virtual machines
  - Defines drivers to interact with hypervisors (KVM, XEN, VMware, etc.)
  - Schedules VMs across physical hosts
2. Neutron (Network Service):
  - Provides API for networking between VMs
  - Manages virtual networks, subnets, routers
  - Handles security groups and firewalls
  - Supports Software-Defined Networking (SDN)
3. Cinder (Block Storage Service):
  - Provides persistent block storage for VMs
  - Supports snapshots and replication
  - Enables live migration
4. Glance (Image Service):
  - Registry for virtual disk images
  - Supports multiple formats (raw, qcow2, vmdk, etc.)
  - Enables users to create VM templates
5. Keystone (Identity Service):
  - Authentication and authorization
  - User and tenant management
  - Service catalog
6. Horizon (Dashboard):
  - Web-based user interface
  - Self-service portal for users
  - Administrative interface
7. Swift (Object Storage):
  - Scalable, redundant object storage
  - REST API for accessing stored objects
  - Similar to Amazon S3
OpenStack Architecture

OpenStack is designed with a distributed architecture:
- Controller Node: Runs API services, database, messaging queue
- Compute Nodes: Run hypervisors that host VMs
- Storage Nodes: Provide block or object storage
- Network Nodes: Handle routing and advanced networking functions
Virtual Networking in Cloud Operating Systems

Virtual networking is a critical component that enables communication between virtual machines and with external networks:

Key Concepts
- Virtual Switches: Software-based switching between VMs on the same host
- Overlay Networks: Encapsulation techniques to create virtual networks over physical infrastructure
- Software-Defined Networking (SDN): Separation of control plane from data plane
- Network Functions Virtualization (NFV): Virtualizing network services like firewalls, load balancers
Network Components
- Virtual NICs: Network interfaces attached to VMs
- Virtual Switches: Connect VMs within a host
- Virtual Routers: Connect different virtual networks
- Security Groups: VM-level firewall rules
- Network Address Translation (NAT): Mapping between private and public IP addresses
Commercial Cloud Platforms

Commercial public clouds use proprietary cloud operating systems:
- Amazon Web Services (AWS): EC2, S3, VPC, etc.
- Microsoft Azure: Azure Compute, Storage, Virtual Network
- Google Cloud Platform (GCP): Compute Engine, Cloud Storage, VPC
- IBM Cloud: Virtual Servers, Object Storage, VPC
- Oracle Cloud: Compute, Block Volume, Virtual Cloud Network
Challenges and Considerations

Operational Challenges
- Complexity: Large-scale distributed systems with many components
- Upgrades: Maintaining service availability during upgrades
- Interoperability: Compatibility between different versions and implementations
- Performance: Ensuring consistent performance with multi-tenancy
- Security: Protecting against virtualization vulnerabilities
Design Considerations
- Scalability: Handling growth from small deployments to thousands of nodes
- Resilience: Continuing operation despite hardware failures
- Efficiency: Maximizing resource utilization
- Compatibility: Supporting different hypervisors and hardware
- Extensibility: Customization and integration with other systems
Link to original
Infrastructure as Code
Infrastructure as Code

Infrastructure as Code (IaC) is the practice of managing and provisioning computing infrastructure through machine-readable definition files rather than physical hardware configuration or interactive configuration tools. It enables infrastructure to be defined, versioned, and deployed in a repeatable, consistent manner.

Core Concepts

Definition and Principles

Infrastructure as Code treats infrastructure configuration as software code that can be:
- Written: Defined in text files with specific syntax or domain-specific languages
- Versioned: Tracked in version control systems (Git, SVN, etc.)
- Tested: Validated through automated testing
- Deployed: Applied automatically to create or modify infrastructure
- Reused: Shared and composed to build complex environments
Key Benefits
1. Consistency: Eliminates configuration drift and “snowflake servers”
2. Speed: Enables rapid provisioning and deployment
3. Scalability: Facilitates managing large-scale infrastructures
4. Version Control: Tracks changes and enables rollbacks
5. Documentation: Self-documenting infrastructure through code
6. Collaboration: Enables team-based infrastructure development
7. Risk Reduction: Automated deployments reduce human error
8. Cost Efficiency: Optimizes resource usage through precise specifications
Challenges Addressed by IaC

Configuration Drift

Configuration drift occurs when systems’ actual configurations diverge from their documented or expected states due to manual changes, ad-hoc fixes, or inconsistent updates. IaC addresses this by:
- Defining a single source of truth for infrastructure
- Enabling detection of unauthorized changes
- Facilitating reconciliation between actual and desired states
Snowflake Servers

Snowflake servers are unique, manually configured servers that:
- Have undocumented configurations
- Cannot be easily replicated
- Represent significant operational risk
- Are difficult to maintain and update
IaC replaces snowflake servers with reproducible, consistent infrastructure.

Manual Configuration Problems

Manual configuration processes lead to:
- Inconsistent environments
- Error-prone deployments
- Poor documentation
- Slow provisioning times
- Difficult recovery from failures
Approaches to Infrastructure as Code

Declarative vs. Imperative

Declarative Approach
- Describes the desired end state of the infrastructure
- System determines how to achieve that state
- Idempotent: repeated applications yield the same result
- Examples: Terraform, AWS CloudFormation, Kubernetes manifests
Imperative Approach
- Specifies the exact commands to achieve the desired state
- Focuses on the steps rather than the outcome
- May not be idempotent without careful design
- Examples: Scripts, some configuration management tools
Mutable vs. Immutable Infrastructure

Mutable Infrastructure
- Infrastructure is updated in-place
- Changes are applied to existing systems
- Traditional approach to system management
- Examples: Configuration management with Ansible, Chef, Puppet
Immutable Infrastructure
- Infrastructure is never modified after deployment
- New versions replace old versions entirely
- Enables easier rollbacks and consistent environments
- Examples: Container deployments, VM images, serverless functions
Popular IaC Tools

Provisioning Tools

Focus on creating and managing infrastructure resources:

Terraform
- Open-source, declarative tool by HashiCorp
- Cloud-agnostic with providers for various platforms
- Uses HashiCorp Configuration Language (HCL)
- Strong state management capabilities
```
# Terraform example: Creating an AWS EC2 instance
resource "aws_instance" "web_server" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
  tags = {
    Name = "WebServer"
  }
}
```
AWS CloudFormation
- Native AWS service for resource provisioning
- Uses JSON or YAML templates
- Integrated with AWS services and permissions
- Supports stack updates and rollbacks
```
# CloudFormation example
Resources:
  MyEC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t2.micro
      ImageId: ami-0c55b159cbfafe1f0
      Tags:
        - Key: Name
          Value: WebServer
```
Azure Resource Manager (ARM)
- Native Azure provisioning service
- JSON-based templates
- Integrated with Azure role-based access control
- Resource grouping and dependency management
Google Cloud Deployment Manager
- Native GCP resource provisioning
- Uses YAML and Python/Jinja2
- Supports preview deployments
Configuration Management Tools

Focus on configuring the software and settings within provisioned resources:

Ansible
- Agent-less configuration management tool
- Uses YAML for playbooks
- Works over SSH
- Relatively easy learning curve
```
# Ansible example: Installing and configuring Nginx
- name: Install and configure nginx
  hosts: web_servers
  become: yes
  tasks:
    - name: Install nginx
      apt:
        name: nginx
        state: present
    - name: Configure nginx
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify:
        - restart nginx
  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted
```
Puppet
- Client-server architecture
- Uses custom Puppet DSL
- Mature ecosystem with modules
- Strong reporting capabilities
Chef
- Ruby-based configuration management
- Uses “recipes” and “cookbooks”
- Highly customizable
- Good integration with CI/CD
SaltStack
- Event-driven automation
- Uses YAML and Jinja
- High scalability
- Both agent and agentless modes
Container Orchestration

Define infrastructure for containerized applications:

Kubernetes Manifests
- YAML-based definitions
- Declarative resource management
- Platform-agnostic container orchestration
- Extensible with custom resources
```
# Kubernetes example: Deploying a web application
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.19
        ports:
        - containerPort: 80
```
Docker Compose
- YAML definition for Docker multi-container applications
- Simpler than Kubernetes
- Good for development environments
- Limited production capabilities
Hybrid and Specialized Tools

Pulumi
- Uses general-purpose programming languages (TypeScript, Python, Go, C#)
- Cloud-agnostic infrastructure definition
- Enables more complex programming constructs
AWS CDK (Cloud Development Kit)
- Defines infrastructure using TypeScript, Python, Java, or C#
- Synthesizes into CloudFormation templates
- Enables reusable components and abstractions
Best Practices for IaC

Version Control
- Store all infrastructure code in version control
- Use branching strategies for changes
- Conduct code reviews for infrastructure changes
- Tag stable versions for production deployments
Modularity and Reusability
- Create reusable modules or components
- Define standard patterns for common resources
- Use parameters and variables for customization
- Implement consistent naming conventions
Testing
- Validate syntax and structure
- Perform static code analysis
- Conduct unit testing for modules
- Implement integration testing in staging environments
- Use policy-as-code tools like OPA, Checkov, or Terraform Sentinel
Security
- Implement least-privilege access for deployment
- Scan IaC definitions for security vulnerabilities
- Encrypt sensitive values and use secret management
- Implement compliance checks in the deployment pipeline
CI/CD Integration
- Automate infrastructure deployments
- Implement multi-environment pipelines
- Use automated testing in the pipeline
- Ensure approvals for production changes
Case Study: Immutable Infrastructure with IaC

Approach
1. Define base infrastructure using Terraform
2. Build standardized VM images with Packer
3. Deploy applications using container orchestration
4. Implement blue-green deployment for updates
5. Version all definitions in Git
Benefits
- Consistent environments from development to production
- Rapid recovery from failures
- Complete change history and audit trail
- Predictable deployments
Link to original
VM Management and Migration
Virtual machine (VM) management encompasses various operations for creating, monitoring, maintaining, and migrating virtual machines in cloud environments. Effective VM management is crucial for optimizing resource usage, ensuring high availability, and maintaining operational efficiency in cloud infrastructures.

VM Lifecycle Management

VM Creation and Deployment

The process of creating and deploying VMs involves:
1. VM Image Selection: Choosing a base image with the required OS and software
2. Resource Allocation: Assigning CPU, memory, storage, and network resources
3. Configuration: Setting VM parameters (name, network, storage paths)
4. Provisioning: Creating the VM instance from the configuration
5. Post-deployment Configuration: Additional setup after VM is running
VM Maintenance Operations

Common VM maintenance operations include:
- Starting/Stopping: Powering VMs on or off
- Pausing/Resuming: Temporarily suspending VM execution
- Resizing: Adjusting allocated resources (vertical scaling)
- Patching/Updating: Applying OS or software updates
- Backup/Restore: Creating and using VM backups
- Monitoring: Tracking performance and health metrics
VM Snapshots

VM snapshots capture the state of a virtual machine at a specific point in time:
- Full Snapshots: Capture entire VM state, including memory
- Disk-only Snapshots: Capture only disk state
- Virtual Snapshots: Use copy-on-write to reduce storage overhead
- Snapshot Trees: Create hierarchical relationships between snapshots
Use Cases for Snapshots:
- Creating system restore points before major changes
- Testing software updates with easy rollback
- Backup and recovery
- VM cloning and templating
Snapshot Limitations:
- Performance impact during creation and while active
- Storage space consumption
- Not a substitute for proper backup strategies
- Potential consistency issues for applications
VM Migration

VM migration is the process of moving a virtual machine from one physical host to another or from one storage location to another. This capability is essential for resource optimization, hardware maintenance, and fault tolerance.

Types of VM Migration

Based on VM State:
1. Cold Migration
  - VM is powered off before migration
  - Complete VM files are copied to the destination
  - VM is started on the destination host
  - No downtime requirement, but service interruption
2. Warm Migration
  - VM is suspended (state saved to disk)
  - VM files and state are copied to the destination
  - VM is resumed on the destination
  - Brief service interruption
3. Live Migration (Hot Migration)
  - VM continues running during migration
  - State is iteratively copied while tracking changes
  - Final brief switchover when difference is minimal
  - Minimal or no perceptible downtime
Based on Migration Scope:
1. Compute Migration: Moving VM execution
2. Storage Migration: Moving VM disk files
3. Combined Migration: Moving both compute and storage
Live Migration Process

Live migration typically follows these steps:
1. Pre-migration:
  - Select source and destination hosts
  - Verify compatibility and resource availability
  - Establish migration channel
2. Reservation:
  - Reserve resources on the destination host
  - Create container for the VM on destination
3. Iterative Pre-copy:
  - Initial copy of memory pages
  - Iterative copying of modified (dirty) pages
  - Continue until rate of page changes stabilizes or threshold reached
4. Stop-and-Copy Phase:
  - Brief suspension of VM on source
  - Copy remaining dirty pages
  - Synchronize final state
5. Commitment:
  - Confirm successful copy to destination
  - Release resources on source
6. Activation:
  - Resume VM execution on destination
  - Update network routing/addressing
  - Resume normal operation
Live Migration Techniques and Technologies

Memory Migration Strategies
1. Pre-copy Approach (most common):
  - VM continues running on source during initial copying
  - Memory pages modified during copy are tracked and re-copied
  - Multiple rounds of copying dirty pages
  - VM paused briefly for final synchronization
2. Post-copy Approach:
  - Minimal VM state transferred initially
  - VM starts running on destination immediately
  - Memory pages fetched from source on demand
  - Background process copies remaining pages
3. Hybrid Approaches:
  - Combine pre-copy and post-copy techniques
  - Adaptively choose strategy based on workload
Network Migration

For successful VM migration, network connections must be preserved:
1. Shared Subnet Approach:
  - Source and destination on same subnet
  - VM retains IP address
  - ARP updates redirect traffic to new location
2. Network Virtualization:
  - Software-defined networking (SDN) abstracts physical network
  - Virtual networks follow VMs during migration
  - Tunnel endpoints updated during migration
3. Mobile IP:
  - Home and foreign agents route traffic to VM’s current location
  - Used for migrations across different subnets
Storage Migration

Approaches for handling VM disk storage during migration:
1. Shared Storage:
  - Source and destination access the same storage (SAN, NAS)
  - Only VM execution state needs to be migrated
  - Fast migration with minimal data transfer
2. Storage Migration:
  - VM disk files copied to destination storage
  - Can be performed separately or with compute migration
  - Significantly increases migration time and network usage
3. Storage Live Migration:
  - Similar to memory live migration
  - Iterative copying while tracking block changes
  - Final synchronization of changed blocks
Case Study: Xen Live Migration

Xen’s live migration implementation illustrates a practical approach:
1. Components:
  - Dom0: Privileged domain controlling migration
  - DomU: User domains (VMs) being migrated
2. Memory Migration:
  - Uses the pre-copy approach
  - Typically achieves 100-300ms downtime for typical workloads
  - Adaptively determines when to switch to stop-and-copy phase
3. Network Handling:
  - After memory transfer, source host sends unsolicited ARP reply
  - Updates IP → MAC mapping in network
  - Destination VM responds to new ARP requests
4. Performance Metrics:
  - Total migration time: Depends on VM memory size and workload
  - Downtime: Typically <300ms for most workloads
  - Network usage: Typically 1.2-1.5× VM RAM size
Advanced VM Management Techniques

Dynamic Resource Allocation

Modern hypervisors support adjusting resources without VM restart:
- CPU Hot Add/Remove: Dynamically change vCPU count
- Memory Ballooning: Reclaim or add memory dynamically
- Storage Live Extension: Expand virtual disks while in use
VM High Availability

Techniques to ensure VM continuity during host failures:
- Automated Restart: Restart failed VMs on available hosts
- VM Clustering: Active-passive or active-active VM arrangements
- Fault Tolerance: Primary-secondary VMs in lockstep execution
VM Placement Optimization

Intelligent placement of VMs across hosts for:
- Load Balancing: Even distribution of workloads
- Power Efficiency: Consolidation for minimal power usage
- Thermal Management: Distribution to manage heating
- Affinity/Anti-affinity Rules: Control VM co-location
Challenges in VM Management and Migration

Performance Considerations
- Migration Overhead: Network and CPU resources consumed
- Application Performance: Impact during migration
- Downtime Sensitivity: Some applications cannot tolerate any disruption
Compatibility Issues
- Hardware Compatibility: CPU feature differences between hosts
- Hypervisor Compatibility: Migration between different hypervisor versions or types
- Storage Compatibility: Different storage architectures or protocols
Complex Environments
- Large Memory VMs: Longer migration times and higher failure risk
- High Change Rate Workloads: Memory pages changing faster than they can be copied
- Specialized Hardware Dependencies: GPUs, FPGAs, or other attached devices
Link to original
DevOps and CI-CD
DevOps is a set of practices that combines software development (Dev) and IT operations (Ops) with the goal of shortening the development lifecycle and delivering high-quality software continuously. Continuous Integration and Continuous Delivery/Deployment (CI/CD) are core practices within the DevOps methodology, providing automation for building, testing, and deploying software.

DevOps Overview

Definition and Philosophy

DevOps represents a cultural shift in how software development and operations teams collaborate:
- Cultural Integration: Breaking down silos between development and operations teams
- Automation: Automating manual, repetitive processes
- Measurement: Continuous monitoring and collection of metrics
- Sharing: Knowledge sharing and collaborative problem-solving
- Improvement: Iterative enhancement of processes and systems
Key Principles
1. Collaboration: Close interaction between development and operations teams
2. Automation: Automating repetitive tasks to reduce errors and improve efficiency
3. Continuous Improvement: Iterative refinement of processes and tooling
4. Customer-Centric Action: Focus on delivering value to end users
5. End-to-End Responsibility: Teams responsible for the entire application lifecycle
6. Monitoring and Feedback: Continuous monitoring and gathering feedback
Benefits of DevOps
- Faster Time to Market: Quicker delivery of features and fixes
- Improved Quality: Automated testing and continuous integration catch issues earlier
- Increased Stability: Smaller, more frequent updates reduce deployment risks
- Better Collaboration: Shared ownership and improved communication
- Efficiency Gains: Automation of routine tasks frees up resources
- Enhanced Security: Security integrated throughout the development lifecycle (DevSecOps)
Continuous Integration (CI)

Continuous Integration is the practice of regularly merging developer work into a shared repository, with automated testing to verify the changes.

Core Concepts
- Frequent Code Integration: Developers commit code frequently (daily or more often)
- Automated Building: Code changes automatically trigger a build process
- Automated Testing: Builds undergo automated testing to verify functionality
- Immediate Feedback: Developers receive quick feedback on their changes
- Shared Repository: Single source of truth for the codebase
CI Process Flow
1. Developer commits code to a shared repository
2. CI server detects the change and triggers a build
3. Code is compiled and built (if applicable)
4. Automated tests are executed (unit, integration, etc.)
5. Test results and build artifacts are reported
6. Feedback is provided to the development team
CI Best Practices
1. Maintain a Single Source Repository: Use version control for all code and configurations
2. Automate the Build Process: Make builds self-testing and reproducible
3. Make Builds Fast: Keep build times short for quick feedback
4. Test in a Clone of Production: Ensure tests run in an environment similar to production
5. Make Results Visible: Ensure build results are easily accessible to all team members
6. Fix Broken Builds Immediately: Prioritize fixing failed builds over new development
Continuous Delivery and Deployment (CD)

Continuous Delivery

Continuous Delivery extends CI by automatically preparing code for release to production.
- Release-Ready Code: Every build passing CI could potentially be deployed
- Automated Release Process: Standardized, automated preparation for deployment
- Manual Approval: Final deployment decision made by humans
Continuous Deployment

Continuous Deployment takes CD further by automatically deploying every change that passes all tests.
- Fully Automated Pipeline: Changes are automatically deployed to production
- No Human Intervention: Deployment occurs without manual approval
- Rapid Feedback Cycle: Changes reach users quickly
CD Process Flow
1. Code passes CI testing
2. Artifacts are prepared for deployment
3. Deployment to staging/pre-production environment
4. Automated acceptance and performance testing
5. Deployment to production (automated or manual approval)
6. Post-deployment verification and monitoring
Deployment Strategies in DevOps

Blue/Green Deployment

A technique that reduces downtime and risk by running two identical production environments:
1. Blue Environment: Current production environment
2. Green Environment: New version is deployed here
3. Testing: Complete testing in the green environment
4. Switch: Traffic is switched from blue to green
5. Rollback: If issues occur, traffic can be directed back to blue
Canary Deployment

Gradually rolling out changes to a small subset of users before full deployment:
1. Deploy new version to a small subset of servers/users
2. Monitor performance and errors
3. Gradually increase the percentage of traffic to new version
4. If issues occur, roll back with minimal impact
5. Complete the rollout once confidence is high
Rolling Updates

Updating instances of an application incrementally:
1. Take a subset of servers out of the load balancer pool
2. Update them with the new version
3. Verify they’re working correctly
4. Return them to the pool and move to the next subset
5. Continue until all servers are updated
CI/CD Tools and Technologies

CI/CD Platforms
- Jenkins: Open-source automation server with extensive plugin ecosystem
- GitLab CI/CD: Integrated CI/CD within the GitLab platform
- GitHub Actions: CI/CD capabilities integrated with GitHub
- CircleCI: Cloud-based CI/CD service
- Travis CI: CI service often used with open-source projects
- Azure DevOps: Microsoft’s suite of DevOps services
Build and Dependency Management
- Maven/Gradle: Build automation for Java
- npm/Yarn: Package management for JavaScript
- Pip/Poetry: Package management for Python
- Docker: Container platform for consistent environments
Testing Tools
- JUnit/TestNG: Unit testing for Java
- Selenium: Browser automation for web testing
- Cypress: End-to-end testing for web applications
- Jest: JavaScript testing framework
- PyTest: Python testing framework
- SonarQube: Static code analysis
Configuration Management
- Ansible: Agentless configuration management
- Puppet: Configuration management with client-server model
- Chef: Ruby-based configuration management
- Terraform: Infrastructure as code for provisioning
Continuous Deployment
- Spinnaker: Multi-cloud continuous delivery platform
- ArgoCD: GitOps continuous delivery for Kubernetes
- Flux CD: GitOps operator for Kubernetes
- Octopus Deploy: Deployment automation server
Monitoring and Feedback
- Prometheus: Monitoring and alerting toolkit
- Grafana: Metrics visualization and dashboards
- ELK Stack: Elasticsearch, Logstash, Kibana for log management
- New Relic/Datadog: Application performance monitoring
CI/CD in Cloud Environments

Cloud-Native CI/CD

CI/CD pipelines designed specifically for cloud environments:
- Infrastructure as Code: Using templates for infrastructure provisioning
- Containers and Orchestration: Docker and Kubernetes for consistent environments
- Serverless Build Processes: Using functions as a service for pipeline stages
- Cloud Provider Services: AWS CodePipeline, Google Cloud Build, Azure Pipelines
CI/CD for Microservices

Adapting CI/CD for microservices architectures:
- Independent Pipelines: Separate pipelines for each microservice
- Service Mesh Integration: Using service meshes for traffic management
- Contract Testing: Ensuring services work together correctly
- Feature Flags: Enabling/disabling features without deployment
Security in CI/CD (DevSecOps)

Integrating security into CI/CD pipelines:
- Static Application Security Testing (SAST): Analyzing source code for vulnerabilities
- Dynamic Application Security Testing (DAST): Testing running applications
- Dependency Scanning: Checking for vulnerabilities in dependencies
- Container Scanning: Analyzing container images for security issues
- Compliance as Code: Automating compliance checks
Case Study: Spinnaker

Spinnaker is a continuous delivery platform developed by Netflix, now maintained as an open-source project:

Key Features
- Multi-Cloud Deployments: Support for AWS, GCP, Azure, Kubernetes, etc.
- Deployment Strategies: Support for various deployment methods
- Pipeline Management: Visual interface for creating and managing pipelines
- Integration: Works with CI systems like Jenkins, Travis, etc.
Spinnaker Pipelines

Spinnaker uses pipelines as the core concept for deployment automation:
1. Triggers: Events that start the pipeline (e.g., git commit, Jenkins build)
2. Stages: Individual steps in the pipeline (e.g., deploy, manual judgment)
3. Server Groups: Sets of identical instances
4. Deployment Strategies: Blue/green, canary, rolling updates
Best Practices for DevOps and CI/CD

Process and Culture
- Start Small: Begin with simple pipelines and iteratively improve
- Embrace Failure: Learn from failures and improve processes
- Document Everything: Maintain documentation for processes and tools
- Measure Improvement: Track metrics to demonstrate value
- Cross-Functional Teams: Include all necessary skills in teams
Technical Practices
- Infrastructure as Code: Manage infrastructure using code
- Immutable Infrastructure: Replace servers instead of changing them
- Comprehensive Testing: Include various testing types (unit, integration, security)
- Monitoring and Observability: Implement robust monitoring and logging
- Security Automation: Include security checks throughout the pipeline
Challenges and Considerations
- Legacy Systems: Adapting DevOps practices for older systems
- Organizational Resistance: Overcoming cultural barriers to adoption
- Skill Gaps: Training teams on new tools and practices
- Tool Proliferation: Managing the growing ecosystem of tools
- Balancing Speed and Quality: Maintaining quality while moving quickly
- Cloud Costs: Managing expenses from automated cloud resource usage
Link to original

Cloud Architectures

Cloud System Design

Distributed System Fundamentals
What Is a Distributed System?

A distributed system can be defined in several ways:

Tanenbaum and van Steen: “A collection of independent computers that appears to its users as a single coherent system”

Coulouris, Dollimore and Kindberg: “One in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages”

Lamport: “One that stops you getting work done when a machine you’ve never even heard of crashes”

Motivations for Distributed Systems

Geographic Distribution: Resources and users are naturally distributed

Example: Banking services accessible from different locations while data is centrally stored

Fault Tolerance: Problems rarely affect multiple locations simultaneously

Multiple database servers in different rooms provide better reliability

Performance and Scalability: Combining resources for enhanced capabilities

High Performance Computing, replicated web servers, etc.

Examples of Distributed Systems

Financial trading platforms

Web search engines (processing 50+ billion web pages)

Social media platforms supporting billions of users

Large Language Models (trained across clusters)

Scientific research (e.g., CERN with over 1 Exabyte of data)

Content Delivery Networks (CDNs)

Online multiplayer games

Fallacies of Distributed Computing

Eight classic assumptions that often lead to problematic distributed systems designs (identified at Sun Microsystems):

The network is reliable

Latency is zero

Bandwidth is infinite

The network is secure

There is one administrator

Transport cost is zero

The network is homogeneous

Topology doesn’t change

Key Aspects of Distributed System Design

System Function: The intended purpose (features and capabilities)

System Behavior: How the system performs its functions

Quality Attributes: Core qualities determining success:

Performance

Cost

Security

Dependability

Challenges in Distributed Systems

Distributed systems introduce complexity in:

Coordination

Consistency

Fault detection and recovery

Security

Performance optimization

Link to original

Cloud Systems Quality Attributes
Quality attributes are non-functional requirements that determine the success of a cloud system beyond its basic functionality.

Core Quality Attributes

1. Performance

Workload handling: Capacity to process the required volume of operations

Efficiency: Resource usage in relation to output

Responsiveness: Speed of response to user requests or events

Throughput: Total amount of work accomplished in a given time period

Latency: Time delay between action and response

2. Cost

Build/deployment costs: Initial setup expenses

Operational costs: Ongoing expenses to run the system

Maintenance costs: Expenses for updates, fixes, and improvements

Resource optimization: Efficient use of hardware, software, and human resources

Scaling costs: Expenses related to growth or contraction

3. Security

Access control: Prevention of unauthorized access

Data protection: Safeguarding sensitive information

Integrity: Ensuring data remains uncorrupted

Confidentiality: Keeping private information private

Compliance: Meeting regulatory requirements

4. Dependability

Availability: Readiness for correct service

Reliability: Continuity of correct service

Safety: Freedom from catastrophic consequences

Integrity: Absence of improper system alterations

Maintainability: Ability to undergo repairs and modifications

Service and Failure Concepts

Correct Service vs. Failure

Correct service: System implements its function as specified

Failure: Deviation from the functional specification

Not binary but exists on a spectrum from optimal to complete failure

Quality of Service (QoS)

A measure of how well a system performs

The ability to provide guaranteed performance levels

Multiple dimensions: latency, bandwidth, security, availability, etc.

Highly contextual and defined for specific applications

Goal: Highest QoS despite faults at the lowest cost

Potential Failure Sources in Datacenters

Hardware Failures

Node/server failures (crashes, timing issues, data corruption)

Power failures (crashes, possible data corruption)

Physical accidents (fire, flood, earthquakes)

Network Failures

Router/gateway failures affecting entire subnets

Name server failures impacting name domains

Network congestion leading to dropped packets

Software and Human Factors

Software complexity leading to bugs

Misconfiguration and human error

Security attacks (both external and internal)

Real-world Datacenter Failures

2008: Amazon S3 major outages affecting US & EU

2011: Amazon EBS and RDS outage lasting 4 days

2015: Apple service disruptions (iTunes, iCloud, Photos)

2016: Google Cloud Platform significant outage

2021: OVHcloud fire destroying datacenters in Strasbourg

Datacenter Failure Statistics

40% of servers experience crashes/unexpected restarts (Google)

57% of failures lead to VM migrations (Google)

Hard drives cause 82% of hardware failures

Power & Cooling are the most common cause of outages (71%)

Over 60% of failures result in $100,000+ losses

Link to original

Failures and Dependability
Understanding Failures, Errors, and Faults

The Fault-Error-Failure Chain

Fault: Hypothesized cause of an error

A defect in the system (e.g., bug in code, hardware defect)

Not all faults lead to errors

Error: Deviation from correct system state

Manifestation of a fault

May exist without causing a failure

Examples: erroneous data, inconsistent internal behavior

Failure: System service deviating from specification

Visible at the service interface

Caused by errors propagating to the service interface

Examples: crash, incorrect output, timing violation

Fault Classification

Faults can be classified along multiple dimensions:

Phase of Creation or Occurrence

Development Faults: Introduced during system development

Operational Faults: Occurring during system operation

System Boundaries

Internal Faults: Originating from within the system

External Faults: Originating from outside the system

Phenomenological Cause

Natural Faults: Caused by natural phenomena

Human-made Faults: Resulting from human actions

Intent

Non-malicious Faults: Without harmful intent

Malicious Faults: With harmful intent (attacks)

Capability/Competence

Accidental Faults: Introduced inadvertently

Incompetence Faults: Due to lack of skills/knowledge

Persistence

Permanent Faults: Persisting until repaired

Transient Faults: Appearing then disappearing

Failure Spectrum

Failure isn’t binary but exists on a spectrum:

Optimal Service: Meeting functional requirements and balancing all quality attributes

Partial Failure: Some parts of the system fail while others continue

Degraded Service: System functions but with reduced performance

Transient Failure: Temporary interruption with automatic recovery

Complete Failure: System becomes unresponsive or produces incorrect results

Dependability Attributes

Dependability Tree

Attributes

Availability: Readiness for correct service

Reliability: Continuity of correct service

Safety: Freedom from catastrophic consequences

Confidentiality: Absence of unauthorized disclosure

Integrity: Absence of improper system alterations

Maintainability: Ability to undergo repair and evolution

Threats

Faults

Errors

Failures

Means

Fault Prevention

Fault Tolerance

Fault Removal

Fault Forecasting

Availability and Reliability

Distinction

Availability: System readiness for service when needed

Measured as percentage of uptime

Focused on accessibility

Reliability: System’s ability to function without failure over time

Measured as Mean Time Between Failures (MTBF)

Focused on continuity

Examples

System with 99.99% availability but produces incorrect results occasionally: High availability, low reliability

System that never crashes but shuts down for maintenance one week each year: High reliability, lower availability (98%)

Link to original

High Availability
Importance of High Availability

Business Impact

Downtime can be extremely costly in today’s interconnected world

Minimizes business disruptions, maintains customer satisfaction, and protects revenue

User Expectations

Users expect 24/7 service availability

Poor availability damages reputation and user trust

Critical Systems

Essential for healthcare, finance, emergency services, and other critical infrastructure

Directly impacts safety and well-being

Availability Levels (The “9’s”)

Availability Downtime per Year Downtime per Month Downtime per Week
90% (one nine) 36.5 days 72 hours 16.8 hours
99% (two nines) 3.65 days 7.2 hours 1.68 hours
99.9% (three nines) 8.76 hours 43.8 min 10.1 min
99.99% (four nines) 52.6 min 4.38 min 1.01 min
99.999% (five nines) 5.26 min 25.9 s 6.06 s
99.9999% (six nines) 31.56 s 2.59 s 0.61 s
99.99999% (seven nines) 3.16 s 259 ms 61 ms

Each additional “9” represents an order-of-magnitude reduction in downtime

Higher availability systems require exponentially more effort and resources

Means to Achieve Dependability

Fault Prevention

Approach: Prevent occurrence of faults proactively

Techniques:

Suitable design patterns

Rigorous requirements analysis

Formal verification methods

Code reviews and static analysis

Fault Tolerance

Approach: Design systems to continue operation despite faults

Techniques:

Redundancy in components and systems

Error detection mechanisms

Recovery mechanisms

Fault Removal

Approach: Identify and reduce existing faults

Techniques:

Early prototyping

Thorough testing

Static code analysis

Debugging

Fault Forecasting

Approach: Predict future fault occurrence and consequences

Techniques:

Performance monitoring

Incident report analysis

Vulnerability auditing

Foundations of High Availability

Fault Tolerance

Key strategies for fault tolerance:

Error detection

Failover mechanisms (error recovery)

Load balancing

Redundancy/replication

Auto-scaling

Graceful degradation

Fault isolation

Error Detection in Data Centers

Monitoring: Collecting metrics like CPU, memory, disk I/O

Heartbeats for basic health indication

Threshold monitoring for overload detection

Telemetry: Analyzing metrics across servers

Identifying patterns and anomalies

Detecting potential security threats

Observability: Understanding internal state through outputs

Log analysis

Tracing communications through the system

Circuit Breaker Pattern

Inspired by electrical circuit breakers

States: Closed (normal), Open (after failures), Half-open (testing recovery)

Prevents overload of failing services

Fails fast rather than degrading under stress

Hardware Error Detection

ECC Memory: Detects and corrects single-bit errors

Redundant components: Multiple power supplies, network interfaces

Real-world Examples

Uber’s M3: Platform for storing and querying time-series metrics

Netflix’s Mantis: Stream processing of real-time data for monitoring

Failover Strategies

Active-Passive Failover

Active: Primary system handling all workload

Passive: Idle standby system synchronized with active

Failover: When active fails, passive becomes active

Variations:

Cold Standby: Needs booting and configuration

Warm Standby: Running but periodically synchronized

Hot Standby: Fully synchronized and ready to take over

Active-Active Failover

Multiple systems simultaneously handling workload

Load balancer distributes traffic

When one system fails, others take over

Provides immediate recovery with no downtime

Decision Factors for Failover Strategy

State management and consistency requirements

Recovery Time Objective (RTO)

Cost constraints

Operational complexity

Link to original

Modern Cloud Architectures - Microservices
Evolution from Monolith to Microservices

Traditional monolithic applications face challenges as they grow:

Increasingly difficult to maintain

Hard to scale specific components

Complex to evolve with changing requirements

Technology lock-in

Microservices architecture emerged as a solution to these challenges.

What Are Microservices?

Microservices architecture is an approach to develop a single application as a suite of small services, each:

Running in its own process

Communicating through lightweight mechanisms (often HTTP/REST APIs)

Independently deployable

Built around business capabilities

Potentially implemented using different technologies

Key Characteristics of Microservices

Loose coupling: Services interact through well-defined interfaces

Independent deployment: Each service can be deployed without affecting others

Technology diversity: Different services can use different technologies

Focused on business capabilities: Services aligned with business domains

Small size: Each service focuses on doing one thing well

Decentralized data management: Each service manages its own data

Automated deployment: CI/CD pipelines for each service

Designed for failure: Resilience built in through isolation

Microservices Architecture Components

A typical microservices architecture includes:

Core Services: Implement business functionality

API Gateway: Provides a single entry point for clients

Service Registry: Keeps track of service instances and locations

Config Server: Centralized configuration management

Monitoring and Tracing: Distributed system observability

Load Balancer: Distributes traffic among service instances

Advantages of Microservices

Independent Development:

Teams can work on different services simultaneously

Faster development cycles

Smaller codebases are easier to understand

Technology Flexibility:

Each service can use the most appropriate tech stack

Easier to adopt new technologies incrementally

Scalability:

Services can be scaled independently based on demand

More efficient resource utilization

Fault Isolation:

Failures in one service don’t necessarily affect others

Easier to implement resilience patterns

Maintainability:

Smaller codebases are less complex

Easier to understand and debug

New team members can become productive faster

Reusability:

Services can be reused in different contexts

Example: Netflix Asgard, Eureka services used in multiple projects

Disadvantages of Microservices

Complexity:

Increased operational overhead with more services to manage and monitor

Distributed debugging challenges - tracing issues across multiple services

Complexity of service interactions and dependencies

Performance Overhead:

Latency due to network communication between services

Serialization/deserialization costs

Network bandwidth consumption

Operational Challenges:

Microservice sprawl - could expand to hundreds or thousands of services

Managing CI/CD pipelines for multiple services

End-to-end testing becomes more difficult

Failure Patterns:

Interdependency chains can cause cascading failures

Death spirals (failures in containers of the same service)

Retry storms (wasted resources on failed calls)

Cascading QoS violations due to bottleneck services

Failure recovery potentially slower than in monoliths

Microservice Communication

Synchronous Communication

REST APIs (HTTP/HTTPS): Simple request-response pattern

gRPC: Efficient binary protocol with bidirectional streaming

GraphQL: Query-based, client specifies exactly what data it needs

Pros:

Immediate response

Simpler to implement

Easier to debug

Cons:

Tight coupling

Higher latency

Lower fault tolerance

Asynchronous Communication

Message queues: RabbitMQ, ActiveMQ

Event streaming: Apache Kafka, AWS Kinesis

Pub/Sub pattern: Google Cloud Pub/Sub

Pros:

Loose coupling

Better scalability

Higher fault tolerance

Cons:

More complex to implement

Harder to debug

Eventually consistent

Glueware and Support Infrastructure

Microservices require substantial supporting infrastructure (“glueware”) that often outweighs the core services:

Monitoring and logging systems

Service discovery mechanisms

Load balancing services

API gateways

Message brokers

Circuit breakers for resilience

Distributed tracing tools

Configuration management

According to the Cloud Native Computing Foundation’s 2022 survey, glueware now outweighs core microservices in most deployments.

Avoiding Microservice Sprawl

To prevent excessive complexity with microservices:

Start with a monolith design

Gradually break it down into microservices as needed

Identify natural boundaries and avoid over-decomposition

Focus on business capabilities

Design around clear business purposes rather than technical functions

Establish clear governance

Define guidelines and best practices for microservice development

Create standards for naming conventions, communication protocols, etc.

Implement fault-tolerant design patterns

Timeouts, bounded retries, circuit breakers

Graceful degradation

Link to original

Link to original

Modern Cloud Architectures
Modern cloud architectures are built on several key concepts that address the challenges of building large-scale, distributed, and reliable systems. This note provides an overview of the architectural approaches used in modern cloud systems.

Architectural Foundations

Modern cloud architectures are founded on two fundamental pillars:
1. Vertical integration - Enhancing capabilities within individual tiers/services
2. Horizontal scaling - Using multiple commodity computers working together
These pillars have led to significant shifts away from monolithic application architectures toward more distributed approaches.

Architectural Concepts

Layering
- Definition: Partitioning services vertically into layers
  - Lower layers provide services to higher ones
  - Higher layers unaware of underlying implementation details
  - Low inter-layer dependency
- Examples:
  - Network protocol stacks (OSI model)
  - Operating systems (kernel, drivers, libraries, GUI)
  - Games (engine, logic, AI, UI)
- Advantages:
  - Abstraction
  - Reusability
  - Loose coupling
  - Isolated management and testing
  - Supports software evolution
Tiering
- Definition: Mapping the organization of and within a layer to physical or virtual devices
  - Implies physical location considerations
  - Complements layering
- Classic Architectures:
  1. 2-tier (client-server): Split layers between client and server
  2. 3-tier: User Interface, Application Logic, Data tiers
  3. n-tier/multi-tier: Further division (e.g., microservices)
- Advantages:
  - Scalability
  - Availability
  - Flexibility
  - Easier management
Monolith vs. Distributed Architecture

Monolithic Architecture
- Definition: A single, tightly coupled block of code with all application components
- Advantages:
  - Simple to develop and deploy
  - Easy to test and debug in early stages
- Disadvantages:
  - Increasing complexity as application grows
  - Difficult to scale individual components
  - Limited agility with slow and risky deployments
  - Technology lock-in
Distributed Architecture
- Definition: Application divided into loosely coupled components running on separate servers
- Advantages:
  - Independent scaling of components
  - Fault isolation
  - Technology diversity
  - Better maintainability
- Disadvantages:
  - Network communication overhead
  - More complex to manage
  - Distributed debugging challenges
Practical Application Guidelines

When designing cloud architectures:
1. Foundation matters: Just as buildings need proper foundations, cloud architectures require robust infrastructure layers
2. Consider scalability & modularity: Employ modular techniques for easier expansion and modification
3. Focus on resource efficiency: Implement auto-scaling, serverless approaches, and efficient resource allocation
4. Plan for evolution: Design systems that can adapt to new technologies while maintaining stability
Modern Cloud Architectures - Redundancy
Redundancy is a key design principle in modern cloud architectures that improves fault tolerance, availability, and performance.

Why Use Redundancy?
- Performance: Distribute workload across multiple replicas to improve response time
- Error Detection: Compare results when replicas disagree
- Error Recovery: Switch to backup resources when primary fails
- Fault Tolerance: System continues functioning despite component failures
Importance of Fault Models

The effectiveness of redundancy depends on how individual replicas fail:
- For independent crash faults, the availability of a system with n replicas is:
```
Availability = 1-p^n
```
  Where p is the probability of individual failure
- Example: 5 servers each with 90% uptime → overall availability = 1-(0.10)^5 = 99.999%
This only holds if failures are truly independent, which requires consideration of common failure modes.

Redundancy by Replication

Replication involves maintaining multiple copies of:
- Data
- Services
- Infrastructure components
Data Replication
- Synchronous Replication: Write operations complete only after all replicas are updated
  - Ensures consistency but increases latency
  - Used for critical data where consistency is paramount
- Asynchronous Replication: Primary replica acknowledges writes before secondaries are updated
  - Better performance but may lose data if primary fails before replication
  - Used when performance is prioritized over consistency
- Quorum-based Replication: Write operations complete when a majority of replicas acknowledge
  - Balances availability and consistency
Service Replication
- Active-Passive Replication:
  - One active instance handles all requests
  - Passive instances ready to take over if active fails
  - Lower resource utilization but potential downtime during failover
- Active-Active Replication:
  - Multiple active instances handle requests simultaneously
  - No downtime during instance failure
  - Requires more complex state management
Infrastructure Redundancy

Modern cloud data centers implement redundancy at multiple levels:

Hardware Redundancy
- Geographic Redundancy:
  - Data centers distributed across multiple regions
  - Mitigates regional outages from natural disasters, power grid failures
  - Data typically replicated across regions
- Server Redundancy:
  - Servers deployed in clusters with automatic failover
  - If one server fails, another takes over seamlessly
- Storage Redundancy:
  - Data replicated across multiple devices and technologies
  - RAID configurations protect against disk failures
Network Redundancy
1. Server-level Redundancy:
  - Redundant Network Interface Cards (NICs)
  - Dual or more power supplies
2. Network-level Redundancy:
  - Redundant switches, routers, firewalls, load balancers
3. Link and Path-level Redundancy:
  - Link aggregation (multiple links between devices)
  - Spanning Tree Protocol to prevent network loops
  - Load balancing across multiple paths
Network topologies designed for redundancy:
- Hierarchical/3-tier topology
- Fat-tree/clos topology
Power Redundancy
- Multiple power feeds from different utility substations
- Uninterruptible Power Supplies (UPS) for temporary outages
- Backup generators for medium/long-term outages
- Power Distribution Units with dual inputs
Cooling Redundancy
- N+1 configuration (one extra cooling unit than required)
- Multiple cooling technologies
- Redundant cooling loops (pipes, heat exchangers, pumps)
- Hot/cold aisle containment
Redundancy Challenges
- Cost: Redundant systems require additional hardware and management
- Complexity: More components mean more potential failure points
- Consistency: Maintaining consistent state across replicas
- Testing: Verifying redundancy actually works as expected
Link to original
Modern Cloud Architectures - Scalability
Scaling Fundamentals

Scaling is the process of adding or removing resources to match workload demand. In cloud architectures, two primary scaling approaches are used:

Vertical Scaling (Scaling Up)
- Definition: Increasing the performance of a single node by adding more resources (CPU cores, memory, etc.)
- Advantages:
  - Good speedup up to a particular point
  - No application architecture changes required
  - Simpler to implement
- Disadvantages:
  - Beyond a certain point, speedup becomes very expensive
  - Limited by hardware capabilities
  - Single point of failure remains
  - Potential downtime during scaling operations
Horizontal Scaling (Scaling Out)
- Definition: Increasing the number of nodes in the system
- Advantages:
  - Cost-effective way to grow total resources
  - Better fault tolerance through redundancy
  - Virtually unlimited scaling potential
- Disadvantages:
  - Requires coordination systems and load balancing
  - Application must be designed for distributed operation
  - More complex to efficiently utilize resources
Why Horizontal Scaling Dominates Cloud Architectures
- Hardware Trend: CPUs are not getting substantially faster as they used to
- Economic Factor: Large sets of inexpensive commodity servers are more cost-effective
- Failure Reality: All hardware eventually fails
- Virtualization Advantage: VMs and containers make it easy to replicate services across nodes
Dynamic Scaling Architecture

Modern cloud systems implement dynamic scaling to automatically adjust resources:
1. Monitoring: Track metrics like CPU usage, memory usage, request rates
2. Thresholds: Define conditions that trigger scaling actions
3. Scaling Actions: Add/remove resources when thresholds are crossed
4. Stabilization: Implement cooldown periods to prevent oscillation
Example Process Flow:
1. Consumers send more requests to a service
2. Existing resources become overloaded, timeouts occur
3. Auto-scaling detects the condition and deploys additional resources
4. Traffic is redistributed across all available resources
Scaling and State

Scaling approaches differ based on whether components are stateless or stateful:

Stateless Components
- Definition: Maintain no internal state beyond processing a single request
- Examples: Web servers with static content, DNS servers, mathematical calculation services
- Scaling Approach: Simply create more instances and distribute requests via load balancing
Stateful Components
- Definition: Maintain state beyond a single request (prior state is required to process future requests)
- Examples: Database servers, mail servers, stateful web servers, session management
- Scaling Approach: More complex, typically requires partitioning and/or replication
Stateless Load Balancing

DNS-Level Load Balancing
- Implementation: DNS servers resolve domain names to different IP addresses
- Advantages: Simple, cost-effective, can use geographical location
- Disadvantages: Slow to react to failures due to DNS caching, limited health checks
IP-Level Load Balancing
- Implementation: Routers direct clients to different locations using IP anycast
- Advantages: Relatively simple, faster response to failures
- Disadvantages: Less granular, assumes all requests create equal load
Application-Level Load Balancing
- Implementation: Dedicated load balancer acting as a front end
- Advantages: Granular control, content-based routing, SSL offloading
- Disadvantages: Increased complexity, performance overhead, higher latency
Stateful Scaling

Scaling stateful services presents unique challenges:

Partitioning (Sharding)
- Definition: Dividing data into distinct, independent parts
- Purpose: Improves scalability (performance), but not availability
- Key Consideration: Each data item is stored in only one partition
Partitioning Schemes:
1. Per-Tenant Partitioning
  - Put different tenants on different machines
  - Good isolation and scalability
  - Challenging when a tenant grows beyond one machine
2. Horizontal Sharding
  - Split table by rows across different servers
  - Each shard has same schema but contains subset of rows
  - Easy to scale out, reduces indices
  - Examples: Google BigTable, MongoDB
3. Vertical Partitioning
  - Split table by columns, grouping related columns
  - Improves performance for specific queries
  - Doesn’t inherently support scaling across multiple servers
Distribution Strategies:
- Range Partitioning
  - Related data stored together
  - Efficient for range queries
  - Poor load balancing, requires manual adjustment
- Hash Partitioning
  - Uniform distribution
  - Good load balancing
  - Inefficient for range queries
  - Requires reorganization when number of partitions changes
Link to original
Modern Cloud Architectures - Microservices
Evolution from Monolith to Microservices

Traditional monolithic applications face challenges as they grow:
- Increasingly difficult to maintain
- Hard to scale specific components
- Complex to evolve with changing requirements
- Technology lock-in
Microservices architecture emerged as a solution to these challenges.

What Are Microservices?

Microservices architecture is an approach to develop a single application as a suite of small services, each:
- Running in its own process
- Communicating through lightweight mechanisms (often HTTP/REST APIs)
- Independently deployable
- Built around business capabilities
- Potentially implemented using different technologies
Key Characteristics of Microservices
- Loose coupling: Services interact through well-defined interfaces
- Independent deployment: Each service can be deployed without affecting others
- Technology diversity: Different services can use different technologies
- Focused on business capabilities: Services aligned with business domains
- Small size: Each service focuses on doing one thing well
- Decentralized data management: Each service manages its own data
- Automated deployment: CI/CD pipelines for each service
- Designed for failure: Resilience built in through isolation
Microservices Architecture Components

A typical microservices architecture includes:
1. Core Services: Implement business functionality
2. API Gateway: Provides a single entry point for clients
3. Service Registry: Keeps track of service instances and locations
4. Config Server: Centralized configuration management
5. Monitoring and Tracing: Distributed system observability
6. Load Balancer: Distributes traffic among service instances
Advantages of Microservices
1. Independent Development:
  - Teams can work on different services simultaneously
  - Faster development cycles
  - Smaller codebases are easier to understand
2. Technology Flexibility:
  - Each service can use the most appropriate tech stack
  - Easier to adopt new technologies incrementally
3. Scalability:
  - Services can be scaled independently based on demand
  - More efficient resource utilization
4. Fault Isolation:
  - Failures in one service don’t necessarily affect others
  - Easier to implement resilience patterns
5. Maintainability:
  - Smaller codebases are less complex
  - Easier to understand and debug
  - New team members can become productive faster
6. Reusability:
  - Services can be reused in different contexts
  - Example: Netflix Asgard, Eureka services used in multiple projects
Disadvantages of Microservices
1. Complexity:
  - Increased operational overhead with more services to manage and monitor
  - Distributed debugging challenges - tracing issues across multiple services
  - Complexity of service interactions and dependencies
2. Performance Overhead:
  - Latency due to network communication between services
  - Serialization/deserialization costs
  - Network bandwidth consumption
3. Operational Challenges:
  - Microservice sprawl - could expand to hundreds or thousands of services
  - Managing CI/CD pipelines for multiple services
  - End-to-end testing becomes more difficult
4. Failure Patterns:
  - Interdependency chains can cause cascading failures
  - Death spirals (failures in containers of the same service)
  - Retry storms (wasted resources on failed calls)
  - Cascading QoS violations due to bottleneck services
  - Failure recovery potentially slower than in monoliths
Microservice Communication

Synchronous Communication
- REST APIs (HTTP/HTTPS): Simple request-response pattern
- gRPC: Efficient binary protocol with bidirectional streaming
- GraphQL: Query-based, client specifies exactly what data it needs
Pros:
- Immediate response
- Simpler to implement
- Easier to debug
Cons:
- Tight coupling
- Higher latency
- Lower fault tolerance
Asynchronous Communication
- Message queues: RabbitMQ, ActiveMQ
- Event streaming: Apache Kafka, AWS Kinesis
- Pub/Sub pattern: Google Cloud Pub/Sub
Pros:
- Loose coupling
- Better scalability
- Higher fault tolerance
Cons:
- More complex to implement
- Harder to debug
- Eventually consistent
Glueware and Support Infrastructure

Microservices require substantial supporting infrastructure (“glueware”) that often outweighs the core services:
- Monitoring and logging systems
- Service discovery mechanisms
- Load balancing services
- API gateways
- Message brokers
- Circuit breakers for resilience
- Distributed tracing tools
- Configuration management
According to the Cloud Native Computing Foundation’s 2022 survey, glueware now outweighs core microservices in most deployments.

Avoiding Microservice Sprawl

To prevent excessive complexity with microservices:
1. Start with a monolith design
  - Gradually break it down into microservices as needed
  - Identify natural boundaries and avoid over-decomposition
2. Focus on business capabilities
  - Design around clear business purposes rather than technical functions
3. Establish clear governance
  - Define guidelines and best practices for microservice development
  - Create standards for naming conventions, communication protocols, etc.
4. Implement fault-tolerant design patterns
  - Timeouts, bounded retries, circuit breakers
  - Graceful degradation
Link to original
Link to original

Availability	Downtime per Year	Downtime per Month	Downtime per Week
90% (one nine)	36.5 days	72 hours	16.8 hours
99% (two nines)	3.65 days	7.2 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.8 min	10.1 min
99.99% (four nines)	52.6 min	4.38 min	1.01 min
99.999% (five nines)	5.26 min	25.9 s	6.06 s
99.9999% (six nines)	31.56 s	2.59 s	0.61 s
99.99999% (seven nines)	3.16 s	259 ms	61 ms

High Availability
Importance of High Availability

Business Impact

Downtime can be extremely costly in today’s interconnected world

Minimizes business disruptions, maintains customer satisfaction, and protects revenue

User Expectations

Users expect 24/7 service availability

Poor availability damages reputation and user trust

Critical Systems

Essential for healthcare, finance, emergency services, and other critical infrastructure

Directly impacts safety and well-being

Availability Levels (The “9’s”)

Availability Downtime per Year Downtime per Month Downtime per Week
90% (one nine) 36.5 days 72 hours 16.8 hours
99% (two nines) 3.65 days 7.2 hours 1.68 hours
99.9% (three nines) 8.76 hours 43.8 min 10.1 min
99.99% (four nines) 52.6 min 4.38 min 1.01 min
99.999% (five nines) 5.26 min 25.9 s 6.06 s
99.9999% (six nines) 31.56 s 2.59 s 0.61 s
99.99999% (seven nines) 3.16 s 259 ms 61 ms

Each additional “9” represents an order-of-magnitude reduction in downtime

Higher availability systems require exponentially more effort and resources

Means to Achieve Dependability

Fault Prevention

Approach: Prevent occurrence of faults proactively

Techniques:

Suitable design patterns

Rigorous requirements analysis

Formal verification methods

Code reviews and static analysis

Fault Tolerance

Approach: Design systems to continue operation despite faults

Techniques:

Redundancy in components and systems

Error detection mechanisms

Recovery mechanisms

Fault Removal

Approach: Identify and reduce existing faults

Techniques:

Early prototyping

Thorough testing

Static code analysis

Debugging

Fault Forecasting

Approach: Predict future fault occurrence and consequences

Techniques:

Performance monitoring

Incident report analysis

Vulnerability auditing

Foundations of High Availability

Fault Tolerance

Key strategies for fault tolerance:

Error detection

Failover mechanisms (error recovery)

Load balancing

Redundancy/replication

Auto-scaling

Graceful degradation

Fault isolation

Error Detection in Data Centers

Monitoring: Collecting metrics like CPU, memory, disk I/O

Heartbeats for basic health indication

Threshold monitoring for overload detection

Telemetry: Analyzing metrics across servers

Identifying patterns and anomalies

Detecting potential security threats

Observability: Understanding internal state through outputs

Log analysis

Tracing communications through the system

Circuit Breaker Pattern

Inspired by electrical circuit breakers

States: Closed (normal), Open (after failures), Half-open (testing recovery)

Prevents overload of failing services

Fails fast rather than degrading under stress

Hardware Error Detection

ECC Memory: Detects and corrects single-bit errors

Redundant components: Multiple power supplies, network interfaces

Real-world Examples

Uber’s M3: Platform for storing and querying time-series metrics

Netflix’s Mantis: Stream processing of real-time data for monitoring

Failover Strategies

Active-Passive Failover

Active: Primary system handling all workload

Passive: Idle standby system synchronized with active

Failover: When active fails, passive becomes active

Variations:

Cold Standby: Needs booting and configuration

Warm Standby: Running but periodically synchronized

Hot Standby: Fully synchronized and ready to take over

Active-Active Failover

Multiple systems simultaneously handling workload

Load balancer distributes traffic

When one system fails, others take over

Provides immediate recovery with no downtime

Decision Factors for Failover Strategy

State management and consistency requirements

Recovery Time Objective (RTO)

Cost constraints

Operational complexity

Link to original

Fault Tolerance
Fault tolerance is the ability of a system to continue operating properly in the event of the failure of one or more of its components. It’s a key attribute for achieving high availability and reliability in distributed systems, especially in cloud environments where component failures are expected rather than exceptional.

Core Concepts

Faults vs. Failures

It’s important to distinguish between faults and failures:
- Fault: A defect in a system component that can lead to an incorrect state
- Error: The manifestation of a fault that causes a deviation from correctness
- Failure: When a system deviates from its specified behavior due to errors
Fault tolerance aims to prevent faults from becoming system failures.

Types of Faults

Faults can be categorized in several ways:

By Duration
- Transient Faults: Occur once and disappear (e.g., network packet loss)
- Intermittent Faults: Occur occasionally and unpredictably (e.g., connection timeouts)
- Permanent Faults: Persist until the faulty component is repaired (e.g., hardware failures)
By Behavior
- Crash Faults: Components stop functioning completely
- Omission Faults: Components fail to respond to some requests
- Timing Faults: Components respond too early or too late
- Byzantine Faults: Components behave arbitrarily or maliciously
By Source
- Hardware Faults: Physical component failures
- Software Faults: Bugs, memory leaks, resource exhaustion
- Network Faults: Communication failures, partitions
- Operational Faults: Human errors, configuration issues
Fault Tolerance Mechanisms

Error Detection

Before handling faults, they must be detected:
- Heartbeats: Regular signals exchanged between components to verify liveness
- Watchdogs: Timers that trigger recovery if not reset within expected intervals
- Checksums and CRCs: Detect data corruption
- Consensus Protocols: Detect inconsistencies between distributed components
- Health Checks: Active probing to verify component functionality
Redundancy

Redundancy is the foundation of most fault tolerance systems:

Hardware Redundancy
- Passive Redundancy: Standby components take over when primary ones fail
- Active Redundancy: Multiple components perform the same function simultaneously
- N-Modular Redundancy: System produces output based on majority voting among redundant components
Information Redundancy
- Error-Correcting Codes: Add redundant data to detect and correct errors
- Checksums: Allow detection of data corruption
- Replication: Maintaining multiple copies of data across different locations
Time Redundancy
- Retry Logic: Repeating operations that fail
- Idempotent Operations: Operations that can be safely repeated without additional effects
Fault Isolation

Containing faults to prevent their propagation through the system:
- Bulkheads: Isolating components so failure in one doesn’t affect others
- Circuit Breakers: Preventing cascading failures by stopping requests to failing components
- Sandboxing: Running code in restricted environments
- Process Isolation: Using separate processes with distinct memory spaces
Recovery Techniques

Techniques for returning to normal operation after a fault:
- Rollback: Returning to a previous known-good state
- Rollforward: Moving to a new state that bypasses the fault
- Checkpointing: Periodically saving system state for recovery
- Process Pairs: Primary process with a backup that can take over
- Transactions: All-or-nothing operations that maintain consistency
- Compensation: Executing operations that reverse the effects of failed operations
Fault Tolerance Patterns

Circuit Breaker Pattern

The Circuit Breaker pattern is designed to detect failures and prevent cascade failures in distributed systems:
- Closed State: Normal operation, requests pass through
- Open State: After failures exceed a threshold, requests are rejected without attempting operation
- Half-Open State: After a timeout, allows limited requests to test if the system has recovered
```
┌─────────────┐   ┌──────────────────┐   ┌─────────────┐
│             │   │                  │   │             │
│   Client    │──▶│  Circuit Breaker │──▶│   Service   │
│             │   │                  │   │             │
└─────────────┘   └──────────────────┘   └─────────────┘
```
Bulkhead Pattern

Based on ship compartmentalization, the Bulkhead pattern isolates elements of an application to prevent failures from cascading:
- Thread Pool Isolation: Separate thread pools for different services
- Process Isolation: Different services run in separate processes
- Service Isolation: Different functionalities in different services
Retry Pattern

The Retry pattern handles transient failures by automatically retrying failed operations:
- Simple Retry: Immediate retry after failure
- Retry with Backoff: Increasing delays between retries
- Exponential Backoff: Exponentially increasing delays
- Jitter: Adding randomness to retry intervals to prevent thundering herd problems
Fallback Pattern

When an operation fails, the Fallback pattern provides an alternative solution:
- Graceful Degradation: Providing reduced functionality
- Cache Fallback: Using cached data when live data is unavailable
- Default Values: Substituting default values when actual values cannot be retrieved
- Alternative Services: Using backup services when primary services fail
Timeout Pattern

The Timeout pattern sets time limits on operations to prevent indefinite waiting:
- Connection Timeouts: Limit time spent establishing connections
- Request Timeouts: Limit time waiting for responses
- Resource Timeouts: Limit time waiting for resource acquisition
Practical Implementation

Fault-Tolerant Microservices

Microservices architectures implement fault tolerance through:
- Service Independence: Isolating services to contain failures
- API Gateways: Routing, load balancing, and failure handling
- Service Discovery: Dynamically finding available service instances
- Client-Side Load Balancing: Distributing requests across multiple instances
Resilient Data Management

Data systems achieve fault tolerance through:
- Database Replication: Primary-secondary or multi-primary configurations
- Partitioning/Sharding: Spreading data across multiple nodes
- Consistent Hashing: Minimizing data redistribution when nodes change
- Eventual Consistency: Tolerating temporary inconsistencies for higher availability
Cloud-Specific Fault Tolerance

Cloud platforms provide various fault tolerance features:
- Auto-scaling Groups: Automatically replace failed instances
- Multi-Zone Deployments: Spreading resources across failure domains
- Managed Services: Abstracting fault tolerance complexity
- Health Checks and Load Balancing: Routing traffic away from unhealthy instances
Testing Fault Tolerance

Chaos Engineering

Systematically injecting failures to test resilience:
- Principles: Build a hypothesis, define “normal,” inject failures, observe, improve
- Failure Injection: Network delays, server failures, resource exhaustion
- Game Days: Scheduled events to simulate failures and practice recovery
- Tools: Chaos Monkey, Gremlin, Chaos Toolkit
Fault Injection Testing

Deliberately introducing faults to validate fault tolerance:
- Unit Level: Testing individual components
- Integration Level: Testing interactions between components
- System Level: Testing entire system resilience
- Production Testing: Carefully controlled testing in production environments
Advanced Concepts

Self-Healing Systems

Systems that automatically detect and recover from failures:
- Autonomous Agents: Components that monitor and heal the system
- Control Loops: Continuous monitoring and adjustment
- Emergent Behavior: System-level resilience from simple component-level rules
Byzantine Fault Tolerance

Handling arbitrary failures, including malicious behavior:
- Byzantine Agreement: Protocols for reaching consensus despite malicious nodes
- Practical Byzantine Fault Tolerance (PBFT): Algorithm for state machine replication
- Blockchain Consensus: Mechanisms like Proof of Work and Proof of Stake
Antifragility

Systems that don’t just resist or tolerate stress but actually improve from it:
- Learning from Failures: Automatically adapting based on failure patterns
- Stress Testing: Deliberately applying stress to identify weaknesses
- Overcompensation: Building stronger systems in response to failures
Case Studies from Lab Exercises

Retry and Fallback Implementation

As practiced in Lab 6, a robust HTTP client implements fault tolerance through:
```
def make_request_with_retry(url, max_retries=3, retry_delay=1):
    for attempt in range(max_retries + 1):
        try:
            response = requests.get(url)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries:
                print(f"Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
            else:
                return {"message": "Service unavailable (fallback)"}
```
Circuit Breaker Implementation

A simplified circuit breaker can be implemented as:
```
class CircuitBreaker:
    CLOSED = 'CLOSED'
    OPEN = 'OPEN'
    HALF_OPEN = 'HALF_OPEN'
    
    def __init__(self, failure_threshold=3, recovery_timeout=10):
        self.state = self.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = None
        
    def execute(self, function, *args, **kwargs):
        if self.state == self.OPEN:
            # Check if recovery timeout has elapsed
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = self.HALF_OPEN
                print("Circuit half-open, testing the service")
            else:
                print("Circuit open, using fallback")
                return self._get_fallback()
                
        try:
            result = function(*args, **kwargs)
            # Success - reset circuit if in half-open state
            if self.state == self.HALF_OPEN:
                self.state = self.CLOSED
                self.failure_count = 0
                print("Circuit closed")
            return result
        except Exception as e:
            # Failure - update circuit state
            self.last_failure_time = time.time()
            self.failure_count += 1
            if self.state == self.CLOSED and self.failure_count >= self.failure_threshold:
                self.state = self.OPEN
                print("Circuit opened due to failures")
            elif self.state == self.HALF_OPEN:
                self.state = self.OPEN
                print("Circuit opened again due to failure in half-open state")
            raise e
            
    def _get_fallback(self):
        # Return cached or default data
        return {"message": "Service unavailable (circuit breaker)", "data": [1, 2, 3]}
```
Link to original
Load Balancing
Load balancing is the process of distributing network traffic across multiple servers to ensure no single server bears too much demand. By spreading the workload, load balancing improves application responsiveness and availability, while preventing server overload.

Core Concepts

Purpose of Load Balancing

Load balancing serves several critical functions:
- Scalability: Handling growing workloads by adding more servers
- Availability: Ensuring service continuity even if some servers fail
- Reliability: Redirecting traffic away from failed or degraded servers
- Performance: Optimizing response times and resource utilization
- Efficiency: Maximizing throughput and minimizing latency
Load Balancer Placement

Load balancers can operate at various points in the infrastructure:
- Client-Side: Load balancing decisions made by clients (e.g., DNS-based)
- Server-Side: Dedicated load balancer in front of server pool
- Network-Based: Load balancing within the network infrastructure
- Global: Geographic distribution of traffic across multiple data centers
Load Balancing Algorithms

Static Algorithms

Static algorithms don’t consider the real-time state of servers:

Round Robin
- Each request is assigned to servers in circular order
- Simple and fair but doesn’t account for server capacity or load
- Variants: Weighted Round Robin gives some servers higher priority
IP Hash
- Uses the client’s IP address to determine which server receives the request
- Ensures the same client always reaches the same server (session affinity)
- Useful for stateful applications where session persistence matters
Dynamic Algorithms

Dynamic algorithms adapt based on server conditions:

Least Connections
- Directs traffic to the server with the fewest active connections
- Assumes connections require roughly equal processing time
- Variants: Weighted Least Connections accounts for different server capacities
Least Response Time
- Sends requests to the server with the lowest response time
- Better distributes load based on actual server performance
- More CPU-intensive for the load balancer to implement
Resource-Based
- Distributes load based on CPU usage, memory, bandwidth, or other metrics
- Requires monitoring agents on servers to report resource utilization
- Most accurate but most complex to implement
Types of Load Balancers

Layer 4 Load Balancers (Transport Layer)
- Operates at the network/transport layer (TCP/UDP)
- Routes traffic based on IP address and port
- Faster and less resource-intensive
- Cannot see the content of the request
- Examples: HAProxy (TCP mode), Nginx (stream module), AWS Network Load Balancer
Layer 7 Load Balancers (Application Layer)
- Operates at the application layer (HTTP/HTTPS)
- Routes based on request content (URL, headers, cookies, etc.)
- More intelligent routing decisions possible
- Higher overhead and latency
- Examples: Nginx, HAProxy (HTTP mode), AWS Application Load Balancer
Global Server Load Balancing (GSLB)
- Distributes traffic across multiple data centers
- Uses DNS to direct clients to the optimal data center
- Considers geographic proximity, data center health, and capacity
- Examples: AWS Route 53, Cloudflare Load Balancing, Akamai Global Traffic Management
Load Balancer Implementations

Hardware Load Balancers
- Purpose-built physical appliances
- Examples: F5 BIG-IP, Citrix ADC, A10 Networks
- Advantages: High performance, hardware acceleration
- Disadvantages: Expensive, limited scalability, harder to automate
Software Load Balancers
- Software running on standard servers
- Examples: Nginx, HAProxy, Traefik
- Advantages: Flexibility, cost-effectiveness, programmability
- Disadvantages: Potentially lower performance than hardware solutions
Cloud Load Balancers
- Managed load balancing services offered by cloud providers
- Examples: AWS Elastic Load Balancing, Google Cloud Load Balancing, Azure Load Balancer
- Advantages: Managed service, automatic scaling, high availability
- Disadvantages: Vendor lock-in, less customization
Configuration Example: Nginx as a Load Balancer

Nginx is a popular web server that can also function as a load balancer. Here’s a basic configuration example:
```
http {
    upstream backend {
        # Round-robin load balancing (default)
        server backend1.example.com:8080;
        server backend2.example.com:8080;
        
        # Weighted load balancing
        # server backend1.example.com:8080 weight=3;
        # server backend2.example.com:8080 weight=1;
        
        # Least connections
        # least_conn;
        
        # IP hash for session persistence
        # ip_hash;
    }
    
    server {
        listen 80;
        
        location / {
            proxy_pass http://backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}
```
This configuration defines an “upstream” group of backend servers and sets up a proxy to distribute requests among them.

Advanced Load Balancing Features

Health Checks

Health checks monitor server availability and readiness:
- Passive: Monitoring real client connections for failures
- Active: Sending test requests to verify server health
- Deep: Checking application functionality, not just connectivity
Example in Nginx:
```
upstream backend {
    server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
    server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
}
```
Session Persistence

Mechanisms to ensure a client’s requests are sent to the same server:
- Cookie-Based: Load balancer inserts a cookie identifying the server
- IP-Based: Uses client IP address to select server
- SSL Session ID: Uses SSL session identifier
SSL Termination

Handling SSL/TLS encryption at the load balancer:
- Decrypts incoming requests and encrypts outgoing responses
- Reduces CPU load on backend servers
- Centralizes certificate management
- Potential security considerations for sensitive data
Load Balancing in Practice

Microservices Architecture

In a Microservices Architecture, load balancers play crucial roles:
- Service-to-service communication balancing
- API gateway load balancing
- Cross-service load distribution
- Service discovery integration
Containerized Environments

Load balancing in container orchestration platforms:
- Kubernetes: Service objects, Ingress controllers
- Docker Swarm: Built-in routing mesh
- Service Mesh: Advanced traffic management (e.g., Istio, Linkerd)
Load Balancing Patterns

Blue-Green Deployment

Using load balancers to switch between two identical environments:
1. Blue environment serves all traffic initially
2. Green environment is prepared with a new version
3. Load balancer switches traffic from blue to green when ready
4. If issues occur, traffic can be switched back to blue
Canary Deployment

Gradually shifting traffic to a new version:
1. Most traffic goes to stable version
2. Small percentage routed to new version
3. Monitor performance and errors
4. Gradually increase traffic to new version if stable
Monitoring and Metrics

Key metrics to monitor for load balancers:
- Request Rate: Number of requests per second
- Error Rate: Percentage of requests resulting in errors
- Response Time: Average and percentile response times
- Connection Count: Active and idle connections
- Backend Health: Status of backend servers
- Resource Utilization: CPU, memory, network usage of the load balancer
Case Study from Lab Exercises

In Lab 7, we implemented a simple load balancing system using Nginx and Docker:

Architecture
- Two identical web services running in Docker containers
- Nginx configured as a reverse proxy and load balancer
- Docker networking for inter-container communication
Implementation Highlights
1. Web Services: Simple Flask applications that identify themselves
```
@app.route('/')
def hello():
    if "service1" in os.environ.get("SERVER_NAME",""):
        return "Hello from Service 1"
    else:
        return "Hello from Service 2"
```
1. Nginx Configuration: Load balancer setup with round-robin algorithm
```
upstream backend {
    server service1:5055;
    server service2:5055;
}
 
server {
    listen 80;
    location / {
        proxy_pass http://backend;
    }
}
```
1. Weighted Load Balancing: Configuring uneven traffic distribution
```
upstream backend {
    server service1:5055 weight=3;
    server service2:5055 weight=1;
}
```
This lab demonstrates how load balancing distributes requests across multiple instances, providing redundancy and improved fault tolerance.
Link to original

Availability	Downtime per Year	Downtime per Month	Downtime per Week
90% (one nine)	36.5 days	72 hours	16.8 hours
99% (two nines)	3.65 days	7.2 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.8 min	10.1 min
99.99% (four nines)	52.6 min	4.38 min	1.01 min
99.999% (five nines)	5.26 min	25.9 s	6.06 s
99.9999% (six nines)	31.56 s	2.59 s	0.61 s
99.99999% (seven nines)	3.16 s	259 ms	61 ms

Cloud Service Models

Cloud Service Models
Cloud Provisioning Models
Cloud computing offers different service models, each providing a different level of abstraction and management. These models define what resources are managed by the provider versus the customer.

Traditional Service Models

Infrastructure as a Service (IaaS)

Definition: Provider provisions processing, storage, network, and other fundamental computing resources where the customer can deploy and run arbitrary software, including operating systems and applications.

Customer manages:

Operating systems

Middleware

Applications

Data

Runtime environments

Provider manages:

Servers and storage

Networking

Virtualization

Data center infrastructure

Key characteristics:

Most flexible cloud service model

Customer has maximum control over infrastructure configuration

Requires the most technical expertise to manage

Examples:

Amazon EC2

Google Compute Engine

Microsoft Azure VMs

OpenStack

Platform as a Service (PaaS)

Definition: Customer deploys applications onto cloud infrastructure using programming languages, libraries, services, and tools supported by the provider.

Customer manages:

Applications

Data

Some configuration settings

Provider manages:

Operating systems

Middleware

Runtime

Servers and storage

Networking

Data center infrastructure

Key characteristics:

Reduces complexity of infrastructure management

Accelerates application deployment

Often includes development tools and services

Less control compared to IaaS

Examples:

Heroku

Google App Engine

Microsoft Azure App Service

AWS Elastic Beanstalk

Software as a Service (SaaS)

Definition: Provider delivers applications running on cloud infrastructure accessible through various client devices, typically via a web browser.

Customer manages:

Minimal application configuration

Data (to some extent)

Provider manages:

Everything including the application itself

All underlying infrastructure and software

Key characteristics:

Minimal management required from customer

Typically subscription-based

Immediate usability

Limited customization

Examples:

Microsoft Office 365

Google Workspace

Salesforce

Dropbox

IaaS in Detail

How IaaS Works

Customer requests VMs with specific configurations (CPU, RAM, storage)

Provider matches request against available data center machines

VMs are provisioned on physical hosts with requested resources

Customer accesses and manages VMs through provided interfaces

Resource Allocation

CPU allocation: Either pinned to specific cores or scheduled by the hypervisor

Memory allocation: Usually strictly partitioned between VMs

Storage: Allocated based on requested volume sizes

Network resources: Shared among VMs with quality of service controls

IaaS APIs

IaaS providers offer APIs for programmatic control of resources:

Create, start, stop, clone operations

Monitoring capabilities

Pricing information access

Resource management

Benefits:

Flexibility through code-based infrastructure control

Automation of provisioning and management

Integration with other tools and systems

IaaS Pricing Models

Typically based on a combination of:

VM instance type/size

Duration of usage (per hour/minute)

Storage consumption

Network traffic

Additional services used

PaaS in Detail

Advantages Over IaaS

Reduced development and maintenance effort

No OS patching or middleware configuration

Higher level of abstraction

Focus on application development rather than infrastructure

PaaS Components

Development tools and environments

Database services

Integration services

Application runtimes

Monitoring and management tools

PaaS Pricing Models

More diverse than IaaS, potentially based on:

Time-based usage

Per query (database services)

Per message (queue services)

Per CPU usage (request-triggered applications)

Storage consumption

Example: Amazon DynamoDB

Key-value store used inside Amazon (powers parts of AWS like S3)

Designed for high scalability (100-1000 servers)

Emphasizes availability over consistency

Uses peer-to-peer approach with no single point of failure

Nodes can be added/removed at runtime

Optimized for key-value operations rather than range queries

SaaS in Detail

Business Model

Provider develops and maintains the application

Offers it to customers for a subscription fee

Handles all updates, security, and infrastructure

Typically multi-tenant, serving many customers on shared infrastructure

Typical SaaS Characteristics

Web-accessible applications

Usually based on monthly/annual subscription

Automatic updates and maintenance

Limited customization compared to self-hosted solutions

Reduced IT overhead for customers

Example: Salesforce

Comprehensive customer relationship management platform

Replaces spreadsheets, to-do lists, and email with integrated platform

Backed by elastic cloud services that scale with company growth

Tiered pricing based on features and user count

Choosing Between Service Models

Factors to consider when selecting a service model:

Core competency assessment: What skills exist in your organization?

Cost considerations: How much can you spend on each layer?

Flexibility requirements: How much control do you need?

Regulatory and privacy concerns: Where does your data need to reside?

This decision applies to both individuals and organizations and should align with strategic goals.
Link to original

Serverless Computing
What Is Serverless Computing?

Serverless computing (also known as Function-as-a-Service or FaaS) is a cloud execution model where the cloud provider dynamically manages the allocation and provisioning of servers. Despite the name “serverless,” servers are still used, but their management is abstracted away from the developer.

Serverless represents an evolution in cloud computing models: IaaS → PaaS → FaaS

Key Characteristics

Event-driven architecture

Functions execute in response to specific triggers or events

No continuous running processes or infrastructure

Ephemeral execution

Functions are created only when needed

No long-running instances waiting for requests

Pay-per-execution model

Billing based only on actual function execution time and resources used

No charges when functions are idle

Automatic scaling

Providers handle all scaling without developer intervention

Scale from zero to peak demand automatically

Stateless execution

Functions don’t maintain state between invocations

External storage required for persistent data

Time-limited execution

Typically limited to 5-15 minutes maximum execution time

Designed for short, focused operations

Serverless Architecture Components

A serverless architecture typically includes:

Core Components

Functions

Self-contained units of code that perform specific tasks

Usually single-purpose with limited scope

Can be written in various programming languages

Event Sources

Triggers that initiate function execution:

HTTP requests via API Gateway

Database changes

File uploads

Message queue events

Scheduled events/timers

Supporting Services

API Gateway: Handles HTTP requests, routing to appropriate functions

State Management: External databases, cache services, object storage

Identity and Access Management: Security and authentication controls

Execution Environment

Functions deploy as standalone units of code

Cold starts occur when new container instances are initialized

Environment is ephemeral with no persistent local storage

Configuration managed through environment variables or parameter stores

Popular Serverless Platforms

AWS Lambda: Pioneer in serverless computing, integrated with AWS ecosystem

Azure Functions: Microsoft’s serverless offering with .NET integration

Google Cloud Functions: Integrated with Google Cloud services

Cloudflare Workers: Edge-focused serverless platform

IBM Cloud Functions: Based on Apache OpenWhisk

DigitalOcean Functions: Serverless offering for smaller deployments

Use Cases for Serverless

Ideal Use Cases:

Event processing

Processing uploads, form submissions, or other user-triggered events

Scheduled tasks

Running periodic jobs like cleanup, reports, or maintenance

Asynchronous processing

Background tasks that don’t need immediate responses

Webhooks and integrations

Handling requests from third-party services

Microservices backends

Building lightweight APIs and service components

IoT applications

Processing data from connected devices

Example Serverless Workflow

A blog post update scenario:

User updates their blog with a new post

Updating webpage content triggers a function

Function logic:

Connect to database

Update database records

Update search index

Trigger other functions (e.g., for ads, analytics, notifications)

Benefits of Serverless Computing

Lower costs

Precise usage-based billing

No paying for idle resources

Reduced operational overhead

Simplified operations

No server management

Provider handles patching, scaling, and availability

Focus on code rather than infrastructure

Enhanced scalability

Automatic resource provisioning

Scale to zero when not in use

Handle unpredictable traffic spikes

Faster time to market

Reduced deployment complexity

Focus on business logic rather than infrastructure

Built-in high availability

Challenges of Serverless Computing

Cold start latency

Initial function invocation can be slow

Particularly impacts rarely-used functions

Vendor lock-in

Functions often rely on provider-specific services and APIs

Migration between providers can be difficult

Limited execution duration

Not suitable for long-running processes

Maximum execution times enforced by providers

Complex state management

No built-in state persistence between invocations

External services required for data storage

Debugging difficulties

Limited visibility into execution environment

Complex distributed systems harder to troubleshoot

Resource constraints

Memory limitations (typically 128MB - 10GB)

CPU allocation tied to memory configuration

Disk space restrictions

Low/No Code Development

Related to serverless is the emergence of low/no-code development platforms:

Definition: Visual environments to create applications with minimal or no coding

Features:

Drag-and-drop interfaces

Pre-built templates

Auto-deployment

Built-in integrations

Examples from major cloud providers:

Amazon Honeycode

Azure Power Apps

Google AppSheet

Azure Logic Apps

Amazon App Runner

Google Vertex AI

Advantages:

Low technical barrier

Rapid development

Flexible control of data assets

Disadvantages:

Vendor lock-in

Limited customization options

Platform dependencies

Serverless vs. Traditional Cloud Models

Aspect Serverless Traditional (VMs/Containers)
Provisioning Automatic Manual or automated scripts
Scaling Automatic and instant Manual or auto-scaling groups
State Stateless by default Can maintain state
Pricing Pay per execution Pay per allocation
Runtime Limited duration Indefinite
Deployment Function-level Application/container level
Cold starts Yes No (for long-running instances)
Resource limits Fixed by provider Configurable
Link to original

Cloud Deployment Models
Cloud deployment models define where cloud resources are located, who operates them, and how users access them. Each model offers different tradeoffs in terms of control, flexibility, cost, and security.

Core Deployment Models

Public Cloud

Definition: Third-party service providers offer cloud services over the public internet to the general public or a large industry group.

Characteristics:

Resources owned and operated by third-party providers

Multi-tenant environment (shared infrastructure)

Pay-as-you-go pricing model

Accessible via internet

Provider handles all infrastructure management

Advantages:

Low initial investment

Rapid provisioning

No maintenance responsibilities

Nearly unlimited scalability

Geographic distribution

Disadvantages:

Limited control over infrastructure

Potential security and compliance concerns

Possible performance variability

Potential for vendor lock-in

Major providers:

AWS, Google Cloud Platform, Microsoft Azure

IBM Cloud, Oracle Cloud

DigitalOcean, Linode, Vultr

Private Cloud

Definition: Cloud infrastructure provisioned for exclusive use by a single organization, either on-premises or hosted by a third party.

Characteristics:

Single-tenant environment

Greater control over resources

Can be managed internally or by third parties

Usually requires capital expenditure for on-premises solutions

Custom security policies and compliance measures

Variations:

On-premises private cloud: Hosted within organization’s own data center

Outsourced private cloud: Hosted by third-party but dedicated to one organization

Advantages:

Enhanced security and privacy

Greater control over infrastructure

Customization to specific needs

Potentially better performance and reliability

Compliance with strict regulatory requirements

Disadvantages:

Higher initial investment

Responsibility for maintenance

Limited scalability compared to public cloud

Requires specialized staff expertise

Technologies:

OpenStack, VMware vSphere/vCloud

Microsoft Azure Stack

OpenNebula, Eucalyptus, CloudStack

Community Cloud

Definition: Cloud infrastructure shared by several organizations with common concerns (e.g., mission, security requirements, policy, or compliance considerations).

Characteristics:

Multi-tenant but limited to specific group

Shared costs among community members

Can be managed internally or by third-party

Designed for organizations with similar requirements

Examples:

Government clouds

Healthcare clouds

Financial services clouds

Research/academic institutions

Advantages:

Cost sharing among community members

Meets specific industry compliance needs

Collaborative environment for shared goals

More control than public cloud

Disadvantages:

Limited to community specifications

Less flexible than public cloud

Costs higher than public cloud

Potential governance challenges

Hybrid Cloud

Definition: Composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities but are bound together by technology enabling data and application portability.

Characteristics:

Combination of public and private/community clouds

Data and applications move between environments

Requires connectivity and integration between clouds

Workloads distributed based on requirements

Approaches:

Application-based: Different applications in different clouds

Workload-based: Same application, different workloads in different clouds

Data-based: Data storage in one cloud, processing in another

Advantages:

Flexibility to run workloads in optimal environment

Cost optimization (use public cloud for variable loads)

Risk mitigation through distribution

Easier path to cloud migration

Balance between control and scalability

Disadvantages:

Increased complexity of management

Integration challenges

Security concerns at connection points

Potential performance issues with data transfer

Requires more specialized expertise

Cross-Cloud Computing

Cross-cloud computing refers to the ability to operate seamlessly across multiple cloud environments.

Types of Cross-Cloud Approaches

Multi-clouds

Using multiple cloud providers independently

Different services from different providers

No integration between clouds

Translation libraries to abstract provider differences

Hybrid clouds

Integration between private and public clouds

Data and applications span environments

Common programming models

Federated clouds

Common APIs across multiple providers

Unified management layer

Consistent experience across providers

Meta-clouds

Broker-based approach

Intermediary selects optimal cloud provider

Abstracts underlying cloud differences

Motivations for Cross-Cloud Computing

Avoiding vendor lock-in: Independence and portability

Resilience: Protection against vendor-specific outages

Service diversity: Leveraging unique capabilities of different providers

Geographic presence: Using region-specific deployments

Regulatory compliance: Meeting data sovereignty requirements

Implementation Tools

Infrastructure as Code tools: Terraform, OpenTofu, Pulumi

Cloud-agnostic libraries: Libcloud, jclouds

Multi-cloud platforms: Commercial and academic proposals

Cloud brokers: Services that manage workloads across clouds

Trade-offs in Cross-Cloud Computing

Complexity: Additional management overhead

Abstraction costs: Loss of provider-specific features

Security challenges: Managing identity across clouds

Performance implications: Data transfer between clouds

Cost management: Multiple billing relationships

Deployment Model Selection Factors

When choosing a deployment model, consider:

Cost Factors

Upfront capital expenditure vs. operational expenses

Total cost of ownership including management costs

Skills required to operate the chosen model

Time to Market

Public cloud offers fastest deployment

Private cloud requires more setup time

Hybrid approaches balance speed with control

Security and Compliance

Regulatory requirements may dictate deployment model

Data sovereignty considerations

Industry-specific compliance frameworks

Control Requirements

Need for physical access to hardware

Customization requirements

Performance guarantees

Comparative Matrix

Aspect Public Cloud Private (Internally Managed) Private (Outsourced)
Upfront Cost Low High Medium
Time to Build Low High Medium
Security Risk Higher Lower Medium
Control Low High Medium
Link to original

Data Centre Design
Data centres are the backbone of cloud computing, and their design plays a crucial role in ensuring sustainability, reliability, and efficiency. This note focuses on the infrastructure design aspects that enable dependable and sustainable data centre operations.

Data Centre Infrastructure Basics

A modern data centre consists of several key components:

Servers: Individual compute units, typically rack-mounted

Racks: Metal frames housing multiple servers

Cooling systems: Equipment to remove heat generated by servers

Power distribution systems: Deliver electricity to all equipment

Network infrastructure: Connects servers internally and to the outside world

Physical security systems: Control access to the facility

Designing for Hardware Redundancy

Geographic Redundancy

Definition: Distributing data centres across multiple geographic regions

Purpose: Mitigate impact of regional outages (natural disasters, power grid failures)

Implementation:

Multiple data centres in different regions

Data replication across regions

Load balancing between regions

Benefit: Ensures continued operation even if an entire region goes offline

Server Redundancy

Definition: Deploying servers in clusters with automatic failover mechanisms

Purpose: Ensure service availability despite individual server failures

Implementation:

Server clusters managed by virtualization technology

Automatic failover when hardware issues are detected

N+1 or N+2 redundancy (extra servers beyond minimum requirements)

Benefit: Seamless operation during hardware failures

Storage Redundancy

Definition: Replicating data across multiple storage devices and technologies

Purpose: Prevent data loss due to disk or storage system failures

Implementation:

RAID configurations to protect against disk failures

Replication within and across data centres

Multiple storage technologies (SSD, HDD, tape) for different tiers

Benefit: Data remains accessible and intact despite storage component failures

Network Redundancy

Reliable networking is critical for data centre operations. Redundancy is implemented at multiple levels:

Server-level Network Redundancy

Redundant Network Interface Cards (NICs) on each server

Dual or more power supplies to eliminate single points of failure

Multiple network paths from each server

Network-level Redundancy

Redundant switches, routers, firewalls, and load balancers

Multiple connection paths between network devices

Diverse carrier connections for external connectivity

Link and Path-level Redundancy

Link aggregation: Multiple physical links between network devices

Spanning Tree Protocol (STP): Prevents network loops while maintaining redundancy

Equal-Cost Multi-Path (ECMP): Distributes traffic across multiple paths

Network Topologies for Redundancy

Hierarchical/3-tier topology:

Access layer (connects to servers)

Aggregation layer (connects access switches)

Core layer (high-speed backbone)

Redundant connections between layers

Fat-tree/Clos topology:

Non-blocking architecture

Multiple equal-cost paths between any two servers

Better scalability and fault tolerance than traditional hierarchical designs

Power Redundancy

Data centres require constant and reliable power supply to function:

Multiple power feeds from different utility substations

Uninterruptible Power Supplies (UPS) for temporary outages

Battery systems that provide immediate power during utility failures

Typically designed to support the data centre for minutes to hours

Backup generators for medium/long-term outages

Diesel or natural gas powered

Automatically start when utility power fails

Sized to power the entire facility for days

Power Distribution Units (PDUs) with dual power inputs

Ensure continuous rack power

Allow maintenance of one power path without downtime

Power Redundancy Configurations

N: Basic capacity with no redundancy

N+1: Basic capacity plus one additional component

2N: Fully redundant, two complete power paths

2N+1: Fully redundant with additional backup

Cooling Redundancy

Data centres generate significant heat that must be removed efficiently:

Heating, Ventilation, and Air Conditioning (HVAC) systems

Control temperature, humidity, and air quality

Critical for equipment longevity and reliability

Cooling redundancy measures:

N+1 cooling: One extra cooling unit beyond required capacity

Multiple cooling technologies to mitigate failure modes

Computer Room Air Conditioning (CRAC) units

Free cooling (using outside air when temperature permits)

In-row cooling (targeted cooling closer to heat sources)

Redundant cooling loops – pipes, heat exchangers, pumps

Hot/Cold aisle containment – prevents hot and cold air mixing

Advanced Cooling Technologies

Free cooling: Using outside air when temperature permits

Liquid cooling: Direct liquid cooling of components

Immersion cooling: Servers submerged in non-conductive liquid

Evaporative cooling: Using water evaporation to reduce temperatures

Design Standards and Tiers

The Uptime Institute defines four tiers of data centre reliability:

Tier I: Basic Capacity

Single path for power and cooling

No redundant components

99.671% availability (28.8 hours downtime/year)

Tier II: Redundant Components

Single path for power and cooling

Redundant components

99.741% availability (22.0 hours downtime/year)

Tier III: Concurrently Maintainable

Multiple paths for power and cooling, only one active

Redundant components

99.982% availability (1.6 hours downtime/year)

Tier IV: Fault Tolerant

Multiple active paths for power and cooling

Redundant components

99.995% availability (0.4 hours downtime/year)

Can withstand any single equipment failure without impact

Sustainable Design Considerations

Modern data centre design increasingly incorporates sustainability features:

Energy-efficient equipment selection

Renewable energy sources (solar, wind, hydroelectric)

Heat recovery systems to repurpose waste heat

Water-efficient cooling technologies

Modular designs for efficient expansion

Smart monitoring systems to optimize resource usage

Real-world Implementation Challenges

Designing highly redundant data centres faces several challenges:

Cost vs. reliability tradeoffs

Physical space constraints

Regulatory and compliance requirements

Upgrading existing facilities

Integrating new technologies with legacy systems

Balancing performance and sustainability goals

Related: Cloud Sustainability - Carbon Footprint Frameworks, Cloud Sustainability - Measurement Granularities, Cloud System Design - High Availability
Link to original
Link to original

Cloud Deployment Models
Cloud deployment models define where cloud resources are located, who operates them, and how users access them. Each model offers different tradeoffs in terms of control, flexibility, cost, and security.

Core Deployment Models

Public Cloud

Definition: Third-party service providers offer cloud services over the public internet to the general public or a large industry group.

Characteristics:
- Resources owned and operated by third-party providers
- Multi-tenant environment (shared infrastructure)
- Pay-as-you-go pricing model
- Accessible via internet
- Provider handles all infrastructure management
Advantages:
- Low initial investment
- Rapid provisioning
- No maintenance responsibilities
- Nearly unlimited scalability
- Geographic distribution
Disadvantages:
- Limited control over infrastructure
- Potential security and compliance concerns
- Possible performance variability
- Potential for vendor lock-in
Major providers:
- AWS, Google Cloud Platform, Microsoft Azure
- IBM Cloud, Oracle Cloud
- DigitalOcean, Linode, Vultr
Private Cloud

Definition: Cloud infrastructure provisioned for exclusive use by a single organization, either on-premises or hosted by a third party.

Characteristics:
- Single-tenant environment
- Greater control over resources
- Can be managed internally or by third parties
- Usually requires capital expenditure for on-premises solutions
- Custom security policies and compliance measures
Variations:
- On-premises private cloud: Hosted within organization’s own data center
- Outsourced private cloud: Hosted by third-party but dedicated to one organization
Advantages:
- Enhanced security and privacy
- Greater control over infrastructure
- Customization to specific needs
- Potentially better performance and reliability
- Compliance with strict regulatory requirements
Disadvantages:
- Higher initial investment
- Responsibility for maintenance
- Limited scalability compared to public cloud
- Requires specialized staff expertise
Technologies:
- OpenStack, VMware vSphere/vCloud
- Microsoft Azure Stack
- OpenNebula, Eucalyptus, CloudStack
Community Cloud

Definition: Cloud infrastructure shared by several organizations with common concerns (e.g., mission, security requirements, policy, or compliance considerations).

Characteristics:
- Multi-tenant but limited to specific group
- Shared costs among community members
- Can be managed internally or by third-party
- Designed for organizations with similar requirements
Examples:
- Government clouds
- Healthcare clouds
- Financial services clouds
- Research/academic institutions
Advantages:
- Cost sharing among community members
- Meets specific industry compliance needs
- Collaborative environment for shared goals
- More control than public cloud
Disadvantages:
- Limited to community specifications
- Less flexible than public cloud
- Costs higher than public cloud
- Potential governance challenges
Hybrid Cloud

Definition: Composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities but are bound together by technology enabling data and application portability.

Characteristics:
- Combination of public and private/community clouds
- Data and applications move between environments
- Requires connectivity and integration between clouds
- Workloads distributed based on requirements
Approaches:
- Application-based: Different applications in different clouds
- Workload-based: Same application, different workloads in different clouds
- Data-based: Data storage in one cloud, processing in another
Advantages:
- Flexibility to run workloads in optimal environment
- Cost optimization (use public cloud for variable loads)
- Risk mitigation through distribution
- Easier path to cloud migration
- Balance between control and scalability
Disadvantages:
- Increased complexity of management
- Integration challenges
- Security concerns at connection points
- Potential performance issues with data transfer
- Requires more specialized expertise
Cross-Cloud Computing

Cross-cloud computing refers to the ability to operate seamlessly across multiple cloud environments.

Types of Cross-Cloud Approaches
1. Multi-clouds
  - Using multiple cloud providers independently
  - Different services from different providers
  - No integration between clouds
  - Translation libraries to abstract provider differences
2. Hybrid clouds
  - Integration between private and public clouds
  - Data and applications span environments
  - Common programming models
3. Federated clouds
  - Common APIs across multiple providers
  - Unified management layer
  - Consistent experience across providers
4. Meta-clouds
  - Broker-based approach
  - Intermediary selects optimal cloud provider
  - Abstracts underlying cloud differences
Motivations for Cross-Cloud Computing
- Avoiding vendor lock-in: Independence and portability
- Resilience: Protection against vendor-specific outages
- Service diversity: Leveraging unique capabilities of different providers
- Geographic presence: Using region-specific deployments
- Regulatory compliance: Meeting data sovereignty requirements
Implementation Tools
- Infrastructure as Code tools: Terraform, OpenTofu, Pulumi
- Cloud-agnostic libraries: Libcloud, jclouds
- Multi-cloud platforms: Commercial and academic proposals
- Cloud brokers: Services that manage workloads across clouds
Trade-offs in Cross-Cloud Computing
- Complexity: Additional management overhead
- Abstraction costs: Loss of provider-specific features
- Security challenges: Managing identity across clouds
- Performance implications: Data transfer between clouds
- Cost management: Multiple billing relationships
Deployment Model Selection Factors

When choosing a deployment model, consider:

Cost Factors
- Upfront capital expenditure vs. operational expenses
- Total cost of ownership including management costs
- Skills required to operate the chosen model
Time to Market
- Public cloud offers fastest deployment
- Private cloud requires more setup time
- Hybrid approaches balance speed with control
Security and Compliance
- Regulatory requirements may dictate deployment model
- Data sovereignty considerations
- Industry-specific compliance frameworks
Control Requirements
- Need for physical access to hardware
- Customization requirements
- Performance guarantees
Comparative Matrix

Aspect Public Cloud Private (Internally Managed) Private (Outsourced)
Upfront Cost Low High Medium
Time to Build Low High Medium
Security Risk Higher Lower Medium
Control Low High Medium
Link to original
Infrastructure as a Service (IaaS)
Infrastructure as a Service (IaaS)
- Definition: Provider provisions processing, storage, networks, and resources
- Customer manages: OS, storage, deployed applications
- Provider manages: Underlying physical infrastructure
- Key characteristics:
  - VMs with configurable CPU/RAM/storage
  - Pay-as-you-go model
  - Customer has maximum control over infrastructure
- Operation:
  - Customer requests VM(s) with specific resource configuration
  - Provider matches against available physical machines
  - Resources are allocated when available
  - Dynamic scaling based on demand
- Examples:
  - AWS EC2, Google Compute Engine
  - Azure VMs, OpenStack
- API capabilities: Create, start, stop, clone, monitor VMs
Link to original
Platform as a Service (PaaS)
Platform as a Service (PaaS)
- Definition: Customer deploys applications using languages, libraries, and tools supported by provider
- Customer manages: Applications and data only
- Provider manages: OS, middleware, runtime, infrastructure
- Compared to IaaS:
  - Higher abstraction level
  - Less development/maintenance effort (no OS patching)
  - Less flexibility, higher provider dependence
- Pricing models:
  - Time-based, per query, per message
  - CPU usage for request-triggered applications
- Example: Amazon DynamoDB
  - Key-value store with high scalability
  - Highly available, peer-to-peer approach
  - No single point of failure
  - Optimized for key-value operations
- Benefits:
  - Reduced development complexity
  - Automatic scaling
  - Focus on application code, not infrastructure
Link to original
Software as a Service (SaaS)
Software as a Service (SaaS)
- Definition: Provider offers ready-made application for direct use
- Customer manages: Minimal application settings only
- Provider handles:
  - Code writing/maintenance
  - Updates
  - Platform integration
  - Automated scaling
- Key aspects:
  - Business model based on subscription
  - Providers offer services cheaper than self-supported
  - Companies reduce IT overhead through outsourcing
  - Tradeoff: Companies no longer own their software
- Example: Salesforce
  - Integrated platform for business operations
  - Replaces spreadsheets, to-do lists, email
  - Backed by elastic cloud services
  - Scales with company growth
  - Tiered pricing based on features
Link to original
Function as a Service (FaaS)
Function as a Service (FaaS)
- Definition: Execution model where provider dynamically manages resources
- Key characteristics:
  - Event-driven architecture
  - Ephemeral logic (functions) created only when needed
  - Pay-per-execution (“no idle resources”)
  - Stateless execution
  - Time-limited (typically 5-15 minutes maximum)
- Components:
  - Function execution environment
  - API Gateway for HTTP requests
  - Event Sources (message queues, storage events, etc.)
  - State Management (external databases, caches)
- Examples:
  - AWS Lambda
  - Azure Functions
  - Google Cloud Run
  - Cloudflare Workers
- Benefits:
  - Lower costs (precise usage-based billing)
  - No servers to manage (reduced complexity)
  - Enhanced scalability
  - Faster deployment times
- Challenges:
  - Cold start latency impacts
  - Vendor lock-in through platform services
  - Complex state management
  - Memory and time constraints
Link to original

Aspect	Serverless	Traditional (VMs/Containers)
Provisioning	Automatic	Manual or automated scripts
Scaling	Automatic and instant	Manual or auto-scaling groups
State	Stateless by default	Can maintain state
Pricing	Pay per execution	Pay per allocation
Runtime	Limited duration	Indefinite
Deployment	Function-level	Application/container level
Cold starts	Yes	No (for long-running instances)
Resource limits	Fixed by provider	Configurable

Aspect	Public Cloud	Private (Internally Managed)	Private (Outsourced)
Upfront Cost	Low	High	Medium
Time to Build	Low	High	Medium
Security Risk	Higher	Lower	Medium
Control	Low	High	Medium

Aspect	Public Cloud	Private (Internally Managed)	Private (Outsourced)
Upfront Cost	Low	High	Medium
Time to Build	Low	High	Medium
Security Risk	Higher	Lower	Medium
Control	Low	High	Medium

Cloud Sustainability

Cloud Carbon Footprint
The carbon footprint of cloud computing refers to the greenhouse gas emissions associated with the deployment, operation, and use of cloud services. As cloud computing continues to grow, understanding and mitigating its environmental impact becomes increasingly important for sustainable IT practices.

Understanding ICT and Cloud Emissions

The Growing Footprint of ICT

Information and Communication Technologies (ICT) are estimated to contribute significantly to global carbon emissions:
- ICT was estimated to produce between 1.0 and 1.7 gigatons of CO₂e (carbon dioxide equivalent) in 2020
- This represents approximately 1.8% to 2.8% of global greenhouse gas emissions
- For comparison, commercial aviation accounts for around 2% of global emissions
- If overall global emissions decrease while ICT emissions remain constant, ICT’s relative share could increase significantly
Cloud Computing’s Contribution

Within the ICT sector, data centers (including cloud infrastructure) are major contributors to emissions:
- Data centers account for approximately one-third of ICT’s carbon footprint
- Cloud computing has both positive and negative effects on overall emissions:
  - Positive: Consolidation, higher utilization, economies of scale
  - Negative: Increased demand, rebound effects, energy-intensive applications
Drivers of Growth

Several technology trends are driving increased emissions from cloud computing:
1. Artificial Intelligence and Machine Learning: Training large models requires significant computational resources
2. Big Data and Analytics: Processing and storing vast amounts of data
3. Internet of Things (IoT): Generating and processing data from billions of connected devices
4. High-Definition Media: Streaming and storing increasingly high-resolution content
5. Blockchain and Cryptocurrencies: Energy-intensive consensus mechanisms
Lifecycle Emissions in Cloud Computing

Cloud carbon emissions can be categorized based on their source in the lifecycle:

Embodied Emissions (Scope 3)

Emissions from raw material sourcing, manufacturing, and transportation of hardware:
- Represents approximately 20-25% of cloud infrastructure’s total emissions
- Includes emissions from producing servers, networking equipment, cooling systems
- Also includes emissions from constructing data centers
- Example: The manufacturing of a server like the Dell PowerEdge R740 can account for nearly 50% of its lifetime carbon footprint
Operational Emissions (Scope 2)

Emissions from using electricity for powering computing and networking hardware:
- Represents approximately 70-75% of cloud infrastructure’s total emissions
- Primary source is electricity consumption for:
  - Server operation
  - Cooling systems
  - Network equipment
  - Power distribution and conversion losses
End-of-Life Emissions (Scope 3)

Emissions from recycling and disposal of e-waste:
- Represents approximately 5% of total emissions
- Includes emissions from transportation, processing, and disposal
- Can be reduced through equipment refurbishment and proper recycling
Measuring Cloud Carbon Footprint

Challenges in Measurement

Accurately measuring cloud carbon footprint faces several challenges:
1. Lack of Transparency: Limited visibility into actual hardware and datacenter operations
2. Methodological Differences: Varying approaches to calculation and reporting
3. Data Availability: Limited access to real-time energy consumption data
4. Shared Infrastructure: Difficulty in attribution for multi-tenant resources
5. Complex Supply Chains: Tracking emissions across global supply chains
Greenhouse Gas Protocol Scopes

The Greenhouse Gas (GHG) Protocol defines three scopes for emissions reporting:
1. Scope 1: Direct emissions from owned or controlled sources
  - For cloud providers: Emissions from backup generators, refrigerants
2. Scope 2: Indirect emissions from purchased electricity
  - For cloud providers: Emissions from electricity powering data centers
  - For cloud users: Considered part of their Scope 3 emissions
3. Scope 3: All other indirect emissions in the value chain
  - For cloud providers: Equipment manufacturing, employee travel, etc.
  - For cloud users: Emissions from using cloud services
Estimation Methodologies

Cloud Provider Reporting

Major cloud providers (AWS, Google Cloud, Microsoft Azure) provide carbon emissions data:
- Usually reported quarterly or annually
- Often aggregated at the service level (e.g., EC2, S3, etc.)
- May use market-based measures including renewable energy credits (RECs)
- Typically not granular enough for detailed optimization
Third-Party Estimation

Tools and methodologies developed to estimate cloud carbon footprint:
1. Cloud Carbon Footprint (CCF) Methodology:
  - Converts resource usage to energy consumption and then to carbon emissions
  - Uses energy conversion factors for different resource types
  - Accounts for PUE (Power Usage Effectiveness)
  - Applies regional grid emissions factors
  Formula:
```
Operational emissions = cloud resource usage × energy conversion factor × PUE × grid emissions factor
```
Measurement Granularity Levels

Cloud computing systems can be measured at multiple levels, from individual components to entire data centers. Each level provides different insights and presents unique measurement challenges.

Software-level Measurement

Software-level measurements focus on the energy and resource consumption of specific applications, processes, or code components.

Tools and Approaches
1. Intel RAPL (Running Average Power Limiting)
  - Previously available as Intel Power Gadget and PowerLog
  - Measures power consumption of CPU cores, graphics, and memory
  - Compatible with modern Intel and AMD CPUs
  - Exposed through the perf wrapper in Linux
2. NVIDIA SMI and NVML
  - SMI: Command-line tool for monitoring NVIDIA GPUs
  - NVML: C-based library for programmatic monitoring
  - Provides power, utilization, temperature, and memory metrics
3. Linux Power Monitoring Tools
  - PowerTOP: Detailed power consumption analysis
  - powerstat: Statistics gathering daemon for power measurements
4. Application-Specific Measurement Libraries
  - CodeCarbon: Estimates carbon emissions of compute
  - PowerAPI: API for building software-defined power meters
  - Scaphandre: Power consumption metrics collector focused on observability
Measurement Methodology

These tools typically use a combination of:
- Hardware performance counters
- Statistical models based on component utilization
- Direct measurements from hardware sensors (where available)
- Correlation with known power consumption patterns
Limitations
- Accuracy varies based on hardware support
- Estimations rather than exact measurements in many cases
- Overhead of measurement process itself
- Limited visibility into hardware-level details
Server-level Measurement

Server-level measurements provide a more comprehensive view of resource consumption for entire physical or virtual machines.

Component-level Monitoring
- CPU power consumption: Per-socket and per-core measurements
- Memory usage: Capacity and bandwidth utilization
- Storage activity: Read/write operations, throughput
- Network traffic: Packets, bandwidth, protocols
Intelligent Platform Management Interface (IPMI)
- Standardized hardware interface for “out-of-band” management
- Functions independent of the server’s operating system
- Uses a dedicated microcontroller called Baseboard Management Controller (BMC)
- Capabilities:
  - Remote administration regardless of OS or power state
  - Monitoring of temperature, voltage, fan speed, power supply status
  - Control functions: power cycling, server restart, BIOS configuration
  - Logging system events and errors for troubleshooting
Power Measurement Accuracy
- Direct measurement via built-in sensors is most accurate
- Some servers provide power data at subsystem level
- Modern servers can report power consumption per component
- Historical data can be logged for trend analysis
Rack-level Measurement

Rack-level measurements focus on the collective consumption of multiple servers and supporting infrastructure within a rack.

Key Measurement Components
- Intelligent Power Distribution Units (PDUs)
  - Provide per-outlet power metering
  - Real-time monitoring of current, voltage, power factor
  - Historical logging capabilities
  - Sometimes include environmental sensors
- Rack Inlet/Outlet Temperature Monitoring
  - Temperature sensors at air intake and exhaust points
  - Used to calculate cooling efficiency
  - Helps identify hotspots and airflow issues
- Per-rack Cooling Efficiency
  - Ratio of cooling power to computing power
  - Identification of over-cooled or under-cooled racks
  - Optimization of airflow and temperature setpoints
Benefits of Rack-level Measurement
- More granular than data center-wide metrics
- Enables identification of inefficient racks
- Supports targeted optimization efforts
- Provides insights for rack placement and design
Data Center-level Measurement

Data center-level measurements provide a holistic view of facility-wide consumption and efficiency.

Total Facility Power Measurement
- IT Equipment Power
  - Servers, storage, and networking equipment
  - The productive power that delivers computing services
- Infrastructure Power
  - HVAC Systems: Cooling, humidity control, air handling
  - Power Distribution: PDUs, UPSs, batteries, transformers
  - Auxiliary Systems: Lighting, security, fire suppression
Environmental Monitoring
- Temperature and humidity throughout the facility
- Airflow patterns and pressure differentials
- Particulate levels and air quality
- Leak detection systems
DC Manageability Interface (DCMI)
- Standard built upon IPMI to address data center-wide manageability
- Extended capabilities for large-scale deployments
- Power management features:
  - Monitoring across multiple systems
  - Power capping to limit consumption during peak demand
  - Aggregated reporting for facility management
Network-level Measurement

Network infrastructure power consumption is often overlooked but forms a significant portion of IT energy use.

Challenges in Network Measurement
- Diverse equipment spanning multiple domains and locations
- Different device models with varying efficiency characteristics
- Dynamic routing and traffic patterns
- Estimated to consume ~1% of global electricity
Measurement Approaches
- Device-level Monitoring: Power consumption per switch, router, firewall
- Traffic-based Estimation: Models relating network traffic to energy use
- Infrastructure Utilization: Correlation between link utilization and power
- End-to-end Analysis: Energy consumed to transfer data between endpoints
Factors Affecting Network Power Consumption
- Hardware specifications and age
- Utilization levels
- Traffic patterns
- Protocol efficiency
- Network topology
- Ambient conditions
Practical Implementation Considerations

Measurement Frequency
- Real-time: Continuous monitoring for immediate action
- Interval-based: Regular sampling (seconds, minutes, hours)
- On-demand: Triggered measurements for specific analysis
Data Storage and Analysis
- Time-series databases for efficient storage of measurement data
- Analytics platforms for trend analysis and anomaly detection
- Visualization tools for dashboard creation and reporting
- Machine learning for pattern recognition and prediction
Integration with Management Systems
- DCIM (Data Center Infrastructure Management) integration
- Correlation with application performance metrics
- Automated actions based on measurement thresholds
- Capacity planning and forecasting
Cost-Benefit Considerations
- Instrumentation costs vs. potential savings
- Additional power overhead of measurement systems
- Staffing requirements for monitoring and analysis
- ROI calculation for measurement initiatives
Case Studies in Measurement Granularity

Google’s Data Center Measurement Approach
- Comprehensive instrumentation from component to facility level
- Custom power monitoring devices for servers
- Machine learning for predictive analytics
- Integration with cooling control systems
- Public reporting of fleet-wide PUE metrics
Financial Services Sector Example
- High-frequency measurements for trading platforms
- Correlation of energy use with transaction volume
- Workload-aware power management
- Regulatory compliance reporting
- Emissions allocation to business units
Challenges and Future Directions

Current Limitations
- Gaps in measurement capability across the stack
- Inconsistent methodologies between organizations
- Limited standardization of metrics and reporting
- Balancing measurement detail with system overhead
Emerging Capabilities
- Non-intrusive load monitoring techniques
- Improved sensor technology with lower overhead
- AI-driven analysis and optimization
- Standardized reporting frameworks
- Carbon-aware application development
Link to original
Energy Efficiency in Cloud
Energy efficiency in cloud computing refers to the optimization of energy consumption in data centers and cloud infrastructure while maintaining or improving performance. As data centers consume approximately 1-2% of global electricity, improving energy efficiency has become a critical focus for environmental sustainability, operational cost reduction, and meeting increasing computing demands.

Evolution of Energy Efficiency

Historical Trends

Energy efficiency in computing has improved significantly over time:
- Koomey’s Law: The number of computations per kilowatt-hour has doubled approximately every 1.57 years from the 1950s to 2000s
- This efficiency improvement rate has slowed in recent years to about every 2.6 years
- The slowdown aligns with broader challenges in Moore’s Law and the end of Dennard scaling
- Despite slowing, significant efficiency improvements continue through specialized hardware and software optimizations
Performance per Watt

Performance per watt is a key metric for energy efficiency:
- Measures computational output relative to energy consumption
- Has increased by orders of magnitude since early computing
- Varies significantly based on workload type and hardware generation
- Continues to be a primary focus for hardware and data center design
Energy Consumption Components

Static vs. Dynamic Power Consumption

Energy consumption in computing hardware can be categorized as:
1. Static Power Consumption:
  - Power consumed when a device is powered on but idle
  - Leakage current in transistors
  - Increases with more advanced process nodes (smaller transistors)
  - Present even when no computation is occurring
2. Dynamic Power Consumption:
  - Power consumed due to computational activity
  - Scales with workload intensity
  - Related to transistor switching activity
  - Can be managed through workload optimization and frequency scaling
Hardware Components Energy Profile

Different hardware components contribute to overall energy consumption:

CPU
- Traditionally the largest consumer (40-50% of server power)
- Energy usage scales with utilization, clock frequency, and voltage
- Modern CPUs have multiple power states for energy management
- Advanced features like core parking and frequency scaling help reduce consumption
Memory
- Accounts for 20-30% of server power
- DRAM refresh operations consume energy even when not in use
- Memory bandwidth and capacity directly impact power consumption
- New technologies like LPDDR and non-volatile memory improve efficiency
Storage
- SSDs typically consume less power than HDDs (no moving parts)
- Power consumption scales with I/O operations per second
- Idle state power can be significant for always-on storage
- Storage tiering helps optimize between performance and power consumption
Network
- Accounts for 10-15% of data center energy
- Energy consumption related to data transfer volume and rates
- Network interface cards, switches, and routers all contribute
- Energy-efficient Ethernet standards help reduce consumption
Energy-Proportional Computing

Concept and Importance

Energy-proportional computing aims to make energy consumption proportional to workload:
- Ideal: Energy usage scales linearly with utilization
- Goal: Zero or minimal energy use at idle, proportional increase with load
- Reality: Most systems consume significant power even when idle
- Importance: Data center servers often operate at 10-50% utilization
Measuring Energy Proportionality

Energy proportionality can be measured using:
- Dynamic Range: Ratio of peak power to idle power
- Proportionality Score: How closely power consumption tracks utilization
- Idle-to-Peak Power Ratio: Percentage of peak power consumed at idle
Progress in Energy Proportionality

Significant improvements have been made in energy proportionality:
- First-generation servers (pre-2007): Poor energy proportionality, nearly constant power regardless of load
- Modern servers (post-2015): Much better scaling, with power consumption more closely tracking utilization
- Example: Google’s servers improved from using >80% of peak power at 10% utilization to <40% of peak power at the same utilization level
- Continuing challenge: Further reducing idle power consumption while maintaining performance
Server Utilization and Energy Efficiency

Typical Utilization Patterns

Server utilization in data centers follows specific patterns:
- Most cloud servers operate between 10-50% utilization on average
- Utilization varies by time of day, day of week, and seasonal factors
- Many servers are provisioned for peak load but run at lower utilization most of the time
- Google’s data shows that most servers in their clusters are below 50% utilization most of the time
Strategies for Improved Utilization

Higher utilization can significantly improve energy efficiency:
1. Workload Consolidation:
  - Concentrating workloads on fewer servers
  - Allows powering down unused servers
  - Challenges: performance isolation, resource contention
2. Virtualization and Containerization:
  - Multiple virtual machines or containers per physical server
  - Flexible resource allocation to match requirements
  - Enables higher average utilization
3. Autoscaling:
  - Automatically adjusting resource allocation based on demand
  - Scaling up/down or in/out depending on workload
  - Minimizes over-provisioning while meeting performance targets
4. Workload Scheduling:
  - Intelligent placement of workloads across servers
  - Considers energy efficiency alongside performance
  - Can consolidate workloads during low-demand periods
Energy-Efficient Data Center Design

Cooling Efficiency

Cooling represents 30-40% of data center energy consumption:
- Free Cooling: Using outside air when temperature and humidity are appropriate
- Hot/Cold Aisle Containment: Preventing mixing of hot and cold air
- Liquid Cooling: More efficient than air cooling, especially for high-density racks
- Optimized Airflow: Reducing resistance and eliminating hotspots
- Temperature Management: Running at higher temperatures where possible
Power Distribution

Power distribution efficiency affects overall energy consumption:
- High-efficiency UPS Systems: Modern UPS systems with >95% efficiency
- High-voltage Distribution: Reducing losses in power transmission
- DC Power: Some data centers use DC power to eliminate AC-DC conversion losses
- Power Monitoring: Granular monitoring to identify inefficiencies
Renewable Energy Integration

Cloud providers increasingly integrate renewable energy:
- On-site Generation: Solar panels, wind turbines, or fuel cells
- Power Purchase Agreements (PPAs): Long-term contracts for renewable energy
- Location Selection: Building data centers near renewable energy sources
- Battery Storage: Storing energy when renewable generation exceeds demand
Measurement Metrics

Power Usage Effectiveness (PUE)

The most widely used metric for data center efficiency:
```
PUE = Total Facility Energy / IT Equipment Energy
```
- Ideal PUE: 1.0 (all energy goes to IT equipment)
- Industry Average: Approximately 1.58 (2022 data)
- Best Practice: 1.2 or lower
- Hyperscale Facilities: Google, Microsoft, and Amazon achieve PUE values around 1.1-1.15
- Limitations: Doesn’t account for IT equipment efficiency or energy source
Other Efficiency Metrics

Additional metrics provide more comprehensive efficiency measurement:
- Carbon Usage Effectiveness (CUE): Emissions per unit of IT energy
- Water Usage Effectiveness (WUE): Water consumption per unit of IT energy
- Energy Reuse Effectiveness (ERE): Accounts for energy reuse (e.g., waste heat)
- IT Equipment Efficiency (ITEE): Measures the efficiency of the IT equipment itself
- Data Center Productivity (DCP): Relates useful work to energy consumption
Challenges and Limitations

Jevons Paradox and Rebound Effects

Efficiency improvements can lead to increased overall consumption:
- Jevons Paradox: As efficiency increases, overall consumption may rise due to increased use
- Direct Rebound: Efficiency makes services cheaper, leading to higher consumption
- Indirect Rebound: Money saved through efficiency is spent on other energy-consuming activities
- Economy-wide Effects: Efficiency drives economic growth, potentially increasing overall energy use
Trade-offs

Energy efficiency often involves trade-offs:
- Performance vs. Efficiency: Lower power may mean reduced performance
- Reliability vs. Efficiency: Some redundancy creates inefficiency
- Capital Expenses vs. Operating Expenses: Efficient equipment may cost more upfront
- Complexity vs. Simplicity: Efficiency features add complexity to management
Best Practices for Energy-Efficient Cloud Computing

Provider-Level Practices

Practices for cloud service providers:
1. Hardware Selection:
  - Choose energy-efficient processors, storage, and networking
  - Consider TCO including energy costs
  - Update hardware on optimal refresh cycles
2. Infrastructure Management:
  - Implement intelligent workload consolidation
  - Use advanced cooling technologies
  - Optimize power delivery systems
3. Renewable Energy:
  - Invest in on-site renewable generation
  - Purchase renewable energy through PPAs
  - Locate data centers strategically for renewable access
User-Level Practices

Practices for cloud service users:
1. Resource Optimization:
  - Right-size virtual machines and instances
  - Implement auto-scaling for variable workloads
  - Terminate unused resources
2. Application Design:
  - Design applications for efficiency (reduced computation, storage, network)
  - Optimize algorithms and data structures
  - Consider serverless for appropriate workloads
3. Workload Scheduling:
  - Run batch jobs during periods of renewable energy abundance
  - Choose regions with low-carbon electricity
  - Utilize spot instances for non-critical workloads
Link to original
Power Usage Effectiveness
Power Usage Effectiveness (PUE) is a metric used to determine the energy efficiency of a data center. Developed by The Green Grid consortium in 2007, PUE has become the industry standard for measuring how efficiently a data center uses its power, specifically how much of the power is used by the computing equipment in contrast to cooling and other overhead.

Definition and Calculation

Basic Formula

PUE is calculated using the following formula:
```
PUE = Total Facility Energy / IT Equipment Energy
```
Where:
- Total Facility Energy: All energy used by the data center facility, including IT equipment, cooling, power distribution, lighting, and other infrastructure
- IT Equipment Energy: Energy used by computing equipment (servers, storage, networking) for processing, storing, and transmitting data
Interpretation

The theoretical ideal PUE value is 1.0, which would mean all energy entering the data center is used by IT equipment with zero overhead:
- PUE = 1.0: Perfect efficiency (theoretical only)
- PUE < 1.5: Excellent efficiency
- PUE = 1.5-2.0: Good efficiency
- PUE = 2.0-2.5: Average efficiency
- PUE > 2.5: Poor efficiency
Industry Trends
- Global Average PUE: Approximately 1.58 (as of 2022)
- Hyperscale Cloud Providers: Best performers, with PUE values of 1.1-1.2
- Older Data Centers: Often have PUE values of 2.0 or higher
- Improvement Over Time: Global average has improved from about 2.5 in 2007 to 1.58 in 2022
Components of Data Center Power

Understanding the components that contribute to total facility energy helps identify opportunities for PUE improvement:

IT Equipment Power (Denominator)

The core computing resources:
- Servers: Processing units that run applications and services
- Storage: Devices that store data (SSDs, HDDs, etc.)
- Network Equipment: Switches, routers, load balancers, etc.
- Other IT Hardware: Security appliances, KVM switches, etc.
Facility Overhead Power (Numerator minus Denominator)

Non-computing power consumption:

Cooling Systems (typically 30-40% of total power)
- Air conditioning units
- Chillers
- Cooling towers
- Computer Room Air Handlers (CRAHs) and Computer Room Air Conditioners (CRACs)
- Pumps for water cooling systems
- Fans and blowers
Power Delivery (typically 10-15% of total power)
- Uninterruptible Power Supplies (UPS)
- Power Distribution Units (PDUs)
- Transformers
- Switchgear
- Generators (during testing)
Other Infrastructure
- Lighting
- Security systems
- Fire suppression systems
- Building Management Systems (BMS)
- Office space within the data center building
Measurement Methodology

The Green Grid defines several levels of PUE measurement, each with increasing accuracy:

Category 0: Annual Calculation
- Based on utility bills or similar high-level measurements
- Lowest accuracy, used for basic reporting
- Single measurement for the entire year
Category 1: Monthly Calculation
- Based on monthly power readings at facility input and IT output
- Moderate accuracy, captures seasonal variations
- Twelve measurements per year
Category 2: Daily Calculation
- Based on daily power readings
- Higher accuracy, captures weekly patterns
- 365 measurements per year
Category 3: Continuous Measurement
- Based on continuous monitoring (15-minute intervals or better)
- Highest accuracy, captures all operational variations
- At least 35,040 measurements per year
Factors Affecting PUE

Several factors influence a data center’s PUE value:

Climate and Location
- Ambient Temperature: Hotter climates require more cooling energy
- Humidity: High humidity locations may need more dehumidification
- Altitude: Affects cooling efficiency and equipment performance
- Regional Weather Patterns: Seasonal variations impact cooling needs
Data Center Design
- Airflow Management: Hot/cold aisle containment, raised floors, rack arrangement
- Building Envelope: Insulation, orientation, materials
- Equipment Density: Higher density requires more focused cooling
- Cooling System Design: Free cooling, liquid cooling, air-side economizers
Operational Practices
- Temperature Setpoints: Higher acceptable temperatures reduce cooling needs
- Equipment Utilization: Higher utilization improves overall efficiency
- Maintenance Practices: Regular maintenance ensures optimal performance
- Power Management: Server power management features, UPS efficiency modes
Scale
- Size: Larger facilities often achieve better PUE due to economies of scale
- Load Profile: Consistent high loads typically yield better PUE than variable loads
Improving PUE

Strategies to improve data center PUE:

Cooling Optimization
- Raise Temperature Setpoints: Operating at the upper end of ASHRAE recommendations
- Hot/Cold Aisle Containment: Preventing mixing of hot and cold air
- Free Cooling: Using outside air when temperature and humidity permit
- Liquid Cooling: More efficient than air cooling, especially for high-density racks
- Variable Speed Fans: Adjusting cooling capacity to match demand
Power Infrastructure Efficiency
- High-Efficiency UPS Systems: Modern UPS systems with 95%+ efficiency
- Modular UPS: Right-sizing UPS capacity to match load
- Power Distribution at Higher Voltages: Reducing conversion losses
- DC Power Distribution: Eliminating AC-DC conversion losses
IT Equipment Optimization
- Server Consolidation: Higher utilization of fewer servers
- Virtualization: Increasing utilization of physical hardware
- Equipment Refresh: Newer equipment is typically more energy-efficient
- Power Management Features: Enabling CPU power states, storage spin-down
Facility Design Improvements
- Airflow Optimization: Eliminating hotspots and recirculation
- Building Management System Integration: Intelligent control of all building systems
- Economizer Modes: Using outside air or water when conditions permit
- On-site Generation: Solar, wind, or fuel cells to offset grid power
Limitations and Criticisms of PUE

Despite its widespread adoption, PUE has several limitations:

Measurement Inconsistencies
- Methodology Differences: Varying approaches to what’s included in measurements
- Boundary Definition: Different interpretations of where the data center boundary lies
- Timing of Measurements: Point-in-time vs. continuous measurement
- Inclusion/Exclusion of Systems: Variations in what’s counted as IT load
Incomplete Picture of Efficiency
- IT Equipment Efficiency Not Addressed: A data center with inefficient servers can have a good PUE
- Workload Efficiency Not Reflected: No indication of useful work per watt
- Water Usage Not Considered: Some cooling techniques improve PUE but increase water consumption
- Carbon Impact Not Included: No consideration of energy sources or carbon intensity
System-Level Trade-offs Not Captured
- Heat Reuse: Systems that capture and repurpose waste heat may have worse PUE but better overall efficiency
- Climate Impact: Data centers in harsh climates face inherent challenges
- Resilience Requirements: Redundancy needs may increase PUE
Enhanced and Alternative Metrics

To address PUE limitations, several complementary metrics have been developed:

Water Usage Effectiveness (WUE)
```
WUE = Annual Water Usage / IT Equipment Energy
```
Measures water efficiency in data centers, particularly important where cooling techniques use significant water.

Carbon Usage Effectiveness (CUE)
```
CUE = Total CO₂ Emissions from Energy / IT Equipment Energy
```
Addresses the carbon impact of the energy sources used.

Energy Reuse Effectiveness (ERE)
```
ERE = (Total Energy - Reused Energy) / IT Equipment Energy
```
Accounts for energy reused outside the data center (e.g., waste heat used for building heating).

Data Center Infrastructure Efficiency (DCiE)
```
DCiE = 1/PUE = IT Equipment Energy / Total Facility Energy × 100%
```
The inverse of PUE, expressed as a percentage.

Green Energy Coefficient (GEC)
```
GEC = Green Energy / Total Energy
```
Measures the proportion of energy from renewable sources.

IT Equipment Utilization (ITEU)

Measures how efficiently the IT equipment uses the energy it consumes to perform useful work.

PUE in Cloud Provider Data Centers

Major cloud providers have significantly invested in improving PUE:

Google
- Average PUE: ~1.10 across all data centers
- PUE Tracking: Publishes trailing twelve-month average PUE for all data centers
- Key Strategies: Machine learning for cooling optimization, custom server design, advanced building management
Microsoft
- Average PUE: ~1.12 for newer data centers
- Innovations: Underwater data centers (Project Natick), hydrogen fuel cells
- Approach: Standardized data center designs optimized for specific regions
Amazon Web Services
- Average PUE: Estimated at 1.15-1.20 (less public with exact metrics)
- Focus Areas: Renewable energy, custom cooling technologies
- Scale Advantage: Large facilities with custom designs for efficiency
Facebook (Meta)
- Average PUE: 1.10
- Open Source: Published designs through Open Compute Project
- Locations: Strategic placement in cold climates where possible
Link to original
Carbon-Aware Computing
Carbon-aware computing is an approach to computing resource management that takes into account the carbon intensity of the electricity powering these resources, with the goal of reducing overall carbon emissions. This approach acknowledges that the same computation can have significantly different carbon impacts depending on when and where it is performed.

Core Concepts

Definition and Principles

Carbon-aware computing is based on several key principles:
1. Carbon Intensity Awareness: Recognizing that the carbon emissions per unit of electricity vary significantly based on:
  - Time (hour, day, season)
  - Location (region, country, grid)
  - Energy sources powering the grid
2. Temporal and Spatial Flexibility: Leveraging the flexibility in when and where computing is performed to minimize carbon emissions
3. Workload Classification: Identifying which workloads can be shifted in time or location without compromising functionality or performance
4. Prioritization: Making carbon impact a primary consideration alongside traditional factors like cost, performance, and reliability
Carbon Intensity of Electricity

Carbon intensity is the amount of carbon dioxide equivalent (CO₂e) emitted per unit of electricity:
- Measured in grams of CO₂e per kilowatt-hour (gCO₂e/kWh)
- Varies dramatically by location: from ~10 gCO₂e/kWh (hydro/nuclear) to >800 gCO₂e/kWh (coal)
- Changes throughout the day based on:
  - Renewable generation (e.g., solar during daytime)
  - Demand patterns
  - Grid management decisions
Types of Carbon Intensity Signals

Two main types of carbon intensity signals are used in carbon-aware computing:

Average Carbon Intensity
- Reflects the overall carbon emissions of the electricity mix
- Based on the weighted average of all generation sources
- Useful for reporting and long-term trend analysis
- Limitations: May not reflect marginal impact of additional consumption
Marginal Carbon Intensity
- Reflects the emissions from the next unit of electricity to be generated
- Indicates the actual impact of increasing or decreasing consumption
- More relevant for real-time decision making
- Challenges: More complex to calculate and forecast
Carbon-Aware Computing Strategies

Temporal Shifting (Time-Shifting)

Moving computing workloads to times when electricity has lower carbon intensity:

Workload Types Suitable for Time-Shifting:
- Batch Processing: ETL jobs, data analytics, scientific computing
- ML Training: Non-urgent machine learning model training
- Maintenance Operations: Backups, upgrades, indexing
- Content Delivery: Pre-generating and caching content
Implementation Approaches:
- Delay Scheduling: Holding jobs until carbon intensity drops below a threshold
- Carbon-Aware Windows: Defining preferred execution windows based on forecasted intensity
- Opportunistic Computing: Dynamically scaling up when renewable generation is high
Spatial Shifting (Location-Shifting)

Moving workloads to locations with lower-carbon electricity:

Workload Types Suitable for Location-Shifting:
- Distributed Processing: Map-reduce jobs, data processing
- Regional Services: Services with global redundancy
- Content Delivery: Content with multiple hosting locations
- Data Processing: Analysis that isn’t tied to data location
Implementation Approaches:
- Geographic Load Balancing: Directing traffic to regions with lower carbon intensity
- Follow-the-Sun (or Wind): Moving compute loads to follow renewable generation
- Carbon-Weighted Autoscaling: Preferentially scaling in regions with lower carbon intensity
Workload Efficiency Optimization

Adapting workload execution based on carbon intensity:
- Quality Adaptation: Adjusting quality/precision based on carbon intensity
- Resource Allocation: Allocating more resources when carbon intensity is low
- Execution Paths: Choosing different algorithms based on carbon availability
- Service Levels: Varying service levels based on carbon intensity
Implementation Mechanisms

Carbon Intensity Data Sources

Sources for carbon intensity information:
- Electricity Maps: Real-time and forecast data for various regions
- WattTime: Marginal carbon intensity data and forecasting
- Grid Operators: Direct data from electricity system operators
- Carbon Intensity API: UK’s National Grid ESO API
- Cloud Provider Tools: Google Cloud Carbon Footprint, Microsoft Sustainability Calculator
Technical Approaches

Methods for implementing carbon-aware computing:
1. Carbon-Aware Schedulers:
  - Enhanced job schedulers that consider carbon intensity
  - Examples: Google Carbon-Intelligent Computing, Microsoft GEAR
2. Carbon-Aware Middleware:
  - Software layers that make carbon-aware decisions transparent to applications
  - Examples: Carbon-Aware Kubernetes Scheduler, SLURM Sustainable Plugin
3. Carbon-Aware Applications:
  - Applications directly integrating carbon awareness
  - Examples: Carbon-aware video streaming, adaptive ML training frameworks
4. Carbon-Aware Infrastructure:
  - Infrastructure designed to operate preferentially on low-carbon electricity
  - Examples: Carbon-aware data center power management
Real-World Applications and Results

Case Studies

Google’s Carbon-Intelligent Computing Platform

Google implemented a carbon-aware computing system that:
- Shifts non-urgent compute tasks to times of day with lower-carbon electricity
- Achieved 50% increase in using lower-carbon energy for compute tasks
- Required no user intervention or application changes
- Prioritized existing tasks with deadline requirements
Microsoft’s Carbon-Aware Azure

Microsoft’s approach involves:
- Intelligent workload placement across regions
- Time-shifting workloads within and between data centers
- Matching renewable energy generation with cloud workloads
- Reported 100,000 metric tons of CO₂ reduction in initial implementation
Academic Research Projects

Several research initiatives have demonstrated:
- 10-30% carbon reductions through simple time-shifting strategies
- Up to 45% reduction through combined time and location shifting
- Minimal impact on performance for appropriate workloads
Simulation Results

Research simulations show significant potential carbon reductions:
1. Periodic Jobs Scenario:
  - Time-shifting nightly builds, integration tests, and recurring business reports
  - Allowing flexible scheduling windows of ±8 hours
  - Results: 30-45% carbon reduction with minimal operational impact
2. Ad Hoc Jobs Scenario:
  - Flexible scheduling of machine learning training jobs
  - Based on dataset from NVIDIA research project with 3387 training jobs
  - Results: 15-20% carbon reduction with delay tolerance of only 3 hours
Challenges and Considerations

Technical Challenges
1. Data Quality and Availability:
  - Carbon intensity data not available for all regions
  - Forecasting accuracy varies
  - Granularity issues with large geographic reporting areas
2. Integration Complexity:
  - Legacy systems not designed for carbon awareness
  - Compatibility with existing schedulers and orchestrators
  - Network and infrastructure limitations
3. Performance Trade-offs:
  - Balancing carbon reduction with performance requirements
  - Meeting service-level agreements while optimizing for carbon
  - User experience considerations
Grid-Level Considerations
1. Renewable Energy Curtailment:
  - Periods when renewable energy exceeds demand and must be curtailed
  - Carbon-aware computing could utilize this otherwise wasted energy
  - In 2022, California curtailed approximately 7% of its solar production
2. Grid Stability:
  - Large-scale workload shifting could impact grid stability
  - Potential for “herding” behaviors if many systems respond to the same signals
  - Need for coordination with grid operators
3. Grid-Aware Computing:
  - Evolution beyond carbon-aware to grid-aware computing
  - Understanding how computing decisions affect grid operations
  - Avoiding negative impacts of simultaneous load shifting
Policy and Organizational Challenges
1. Metrics and Reporting:
  - Standardizing how carbon savings are measured and reported
  - Integrating with existing sustainability reporting frameworks
  - Validating actual carbon impact
2. Incentives and Priorities:
  - Aligning carbon reduction with business objectives
  - Developing internal carbon pricing mechanisms
  - Communicating trade-offs to stakeholders
3. Organizational Boundaries:
  - Coordinating between IT, sustainability, and business units
  - Addressing data sovereignty and compliance requirements
  - Balancing carbon considerations with other organizational priorities
Renewable Energy Integration

Renewable Excess Energy Utilization

Using cloud resources to consume excess renewable energy:
- Curtailment Problem: When renewable generation exceeds demand, energy may be wasted
- Opportunity: Compute resources can consume this otherwise curtailed energy
- Datacenter Locations: Strategic placement near renewable generation sources
- Dynamic Resource Allocation: Scaling up compute during periods of excess renewables
Carbon-Aware vs. Energy-Efficient Computing

Important distinctions between approaches:
1. Energy Efficiency: Using less energy to perform the same computation
  - Focus: Reducing overall energy consumption
  - Metric: Performance per watt
2. Carbon Awareness: Timing or relocating computation for lower emissions
  - Focus: Reducing carbon emissions per computation
  - Metric: Carbon per computation
3. Complementary Approaches:
  - Energy efficiency reduces the baseline consumption
  - Carbon awareness optimizes the timing and location of that consumption
  - Both are necessary for comprehensive emissions reduction
Future Directions

Emerging Research Areas
1. Machine Learning for Carbon Prediction:
  - Improved forecasting of carbon intensity
  - ML-based workload characterization for shifting potential
  - Predictive scheduling algorithms
2. Carbon-Aware Edge Computing:
  - Distributing computation between cloud and edge based on carbon signals
  - Edge devices powered by local renewable generation
  - Location-specific carbon optimization
3. Carbon-Aware Hardware:
  - Dynamic power scaling based on carbon intensity
  - Hardware-level support for workload shifting
  - Power-proportional computing with carbon awareness
Integration with Broader Sustainability Initiatives

Carbon-aware computing as part of holistic approaches:
1. Circular Economy:
  - Integration with equipment lifecycle management
  - Carbon-aware decisions on hardware refresh cycles
  - Balancing embodied carbon with operational efficiency
2. Green Software Engineering:
  - Designing software with carbon awareness from the beginning
  - Carbon metrics as first-class software design considerations
  - Standardized tools and frameworks for carbon-aware development
3. Climate-Positive Computing:
  - Moving beyond carbon neutrality to climate positivity
  - Using computation to enable broader carbon reductions
  - Supporting climate science and mitigation technologies
Jevons’ Paradox:

As technology makes resource use more efficient, demand increases, so resource use overall often increases
Link to original

Quartz 4

Explorer

Cloud Systems Print

Fundamentals

Cloud Computing Introduction

What is Cloud Computing?

Evolution of Distributed Computing

Key Enablers of Cloud Computing

Virtualization

Resource Pooling and Multi-tenancy

Automation and Self-service

Elasticity and Scalability

Challenges for Cloud Providers

NIST Cloud Definition

Definition

The Three Dimensions of Cloud Computing

Five Essential Characteristics

Three Service Models

Four Deployment Models

Clusters vs Grids vs Clouds

Clusters

Key Characteristics:

Examples:

Use Cases:

Grids

Key Characteristics:

Examples:

Use Cases:

Clouds

Key Characteristics:

Examples:

Use Cases:

Comparison

Evolution and Relationship

Virtualization

Virtualization Fundamentals

Definition

Key Concepts

Formal Definition

Categories of Virtualization

1. Process Virtualization

2. OS-Level Virtualization

3. System Virtualization

Types of Hypervisors

Type 1 (Bare-Metal Hypervisors)

Type 2 (Hosted Hypervisors)

Importance in Cloud Computing

Challenges

Virtual Machines

Definition

Key Components

Hypervisor (Virtual Machine Monitor)

Guest Operating System

Virtual Hardware

VM Images

Virtualizability

Virtualization Approaches

Use Cases for Virtual Machines

Performance Considerations

VM Pausing vs Suspending

Full Virtualization

Key Characteristics

Challenges with x86 Architecture

Binary Translation

How Binary Translation Works

Memory Management in Full Virtualization

Shadow Page Tables

I/O Virtualization in Full Virtualization

Performance Implications

Examples of Full Virtualization

Advantages and Disadvantages

Advantages

Disadvantages

OS-Assisted Virtualization

Key Concept

How OS-Assisted Virtualization Works

Xen: A Classic Example

Xen Architecture

CPU Virtualization in Xen

Memory Management in Xen