Lab 6: Fault Tolerance

Objectives

  • Understand and implement retry mechanisms and fallback strategies
  • Improve application dependability in a cloud environment
  • Learn to handle temporary failures, prolonged outages, and degraded modes

Key Components and Tasks

1. Flaky Service Implementation

Creation of a simple web service with a configurable failure rate:

from flask import Flask, jsonify
import random
import os
 
app = Flask(__name__)
 
# Environment variable to control service behavior: "normal", "fallback", "failure"
SERVICE_MODE = os.environ.get("SERVICE_MODE", "normal")
 
@app.route('/')
def flaky_endpoint():
    if SERVICE_MODE == "failure":
        return jsonify({"message": "Service Unavailable"}), 503
    elif SERVICE_MODE == "fallback":
        return jsonify({"message": "Service in degraded mode (fallback)", "data": [1, 2, 3]}), 200
    elif SERVICE_MODE == "normal":
        if random.random() < 0.3:  # Simulate a 30% failure rate
            return jsonify({"message": "Service Unavailable"}), 503
        else:
            return jsonify({"message": "Hello from the service!", "data": [1, 2, 3, 4, 5]}), 200
 
if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5055)

2. Client with Retry Mechanism

Implementing a simple retry mechanism for handling temporary failures:

import requests
import time
 
def make_request_with_retry(url, max_retries=3, retry_delay=1):
    for attempt in range(max_retries + 1):
        try:
            response = requests.get(url)
            response.raise_for_status()  # HTTPError (4xx/5xx) for bad responses
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries:
                print(f"Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
            else:
                return {"message": "Service unavailable (fallback)"}

3. Advanced Client with Fallback Strategy

  • Implementing graceful degradation when services fail
  • Detecting service in fallback mode and responding accordingly
  • Using cached/limited data as a fallback mechanism

4. Circuit Breaker Pattern (Optional)

  • Implementing a circuit breaker to prevent overloading failing services
  • Managing circuit states: CLOSED, OPEN, HALF-OPEN
  • Implementing dynamic recovery behavior

5. Testing and Observation

Testing the client-service interaction under different scenarios:

  • Normal operation with occasional failures
  • Complete service failure
  • Service in degraded (fallback) mode
  • Circuit breaker operation and recovery

Key Concepts Learned

  • Resilient service design
  • Fault tolerance patterns
  • Graceful degradation
  • Circuit breaker pattern
  • Service health monitoring

Lab 7: Load Balancing

Objectives

  • Understand the principles of load balancing
  • Configure and test load balancing with microservices
  • Set up Nginx as a reverse proxy and load balancer

Key Components and Tasks

1. Simple Service Implementation

Creating a service that identifies itself:

from flask import Flask
import os
 
app = Flask(__name__)
 
@app.route('/')
def hello():
    if "service1" in os.environ.get("SERVER_NAME",""):
        return "Hello from Service 1"
    else:
        return "Hello from Service 2"
 
if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5055)

2. Docker Network and Service Deployment

  • Creating a Docker network for container communication
  • Running multiple instances of the service with different identities
  • Exposing services on different host ports
docker network create my-network
docker build -t hello-service .
docker run -d -p 5056:5055 --name service1 -e SERVER_NAME="service1" --network my-network --network-alias service1 hello-service
docker run -d -p 5057:5055 --name service2 -e SERVER_NAME="service2" --network my-network --network-alias service2 hello-service

3. Nginx Load Balancer Configuration

Setting up Nginx as a reverse proxy and load balancer:

events {
    worker_connections 1024;
}
 
http {
    upstream backend {
        # round-robin load balancing
        server service1:5055;
        server service2:5055;
        
        # weighted load balancing
        # server service1:5055 weight=3;
        # server service2:5055 weight=1;
    }
    
    server {
        listen 80;
        location / {
            proxy_pass http://backend;
        }
    }
}

4. Testing and Observation

  • Testing direct access to each service
  • Testing access through the load balancer
  • Observing round-robin load balancing behavior
  • Testing service resilience by stopping one service
  • Testing weighted load balancing

Key Concepts Learned

  • Load balancing techniques
  • Reverse proxy configuration
  • Docker networking
  • Service discovery
  • High availability through redundancy
  • Load balancing algorithms (round-robin, weighted)

Common Lab Techniques

Docker and Containerization

  • Dockerfile creation and best practices
  • Container networking
  • Environment variable configuration
  • Container orchestration

API Design and Implementation

  • RESTful API principles
  • Flask for lightweight web services
  • JSON for data exchange
  • Status codes for error handling

Resilience Patterns

  • Retry mechanisms
  • Fallback strategies
  • Circuit breakers
  • Load balancing
  • Service discovery

Testing and Debugging

  • API testing with curl and browsers
  • Debugging distributed systems
  • Log analysis
  • Service monitoring

References:

  • COMPSCI4106/5118 Cloud Systems Lab Materials