Building distributed systems requires understanding fundamental challenges and patterns. This guide covers core concepts that every architect and engineer should know.
Why Distributed Systems?
Benefits:
┌─────────────────────────────────────────────────────────┐
│ ✓ Scalability - Add more machines to handle load │
│ ✓ Reliability - No single point of failure │
│ ✓ Performance - Parallel processing across machines │
│ ✓ Cost-effective - Use commodity hardware │
│ ✓ Availability - Continue running during failures │
└─────────────────────────────────────────────────────────┘
The CAP Theorem
Understanding CAP
CAP Theorem states that a distributed system can only guarantee 2 of 3:
Consistency Availability Partition Tolerance
│ │ │
▼ ▼ ▼
All nodes see Every request System works
same data gets response during network
partitions
In reality: Partitions are inevitable, so we choose between:
- CP (Consistency + Partition Tolerance): e.g., ZooKeeper, etcd
- AP (Availability + Partition Tolerance): e.g., Cassandra, DynamoDB
Consistency Models
Strong Consistency:
┌─────┐ ──────┐ ──────┐ ──────┐
│ A:0 │ Write │ │ │
└─────┘ ──────┘ ──────┘ ──────┘
Write to B
│ All reads see write immediately
Eventual Consistency:
┌─────┐ ──────┐ ──────┐ ──────┐
│ A:0 │ Write │ │ │
└─────┘ ──────┘ ──────┘ ──────┘
Async replication
│ Reads may see stale data briefly
Read-your-writes Consistency:
┌─────┐ ──────┐ ──────┐ ──────┐
│ A:0 │ Write │ │ │
└─────┘ ──────┘ ──────┘ ──────┘
│ After write, same node reads see update
Consensus Algorithms
Raft Algorithm
Raft is a consensus algorithm designed to be understandable. It uses:
┌─────────────────────────────────────────────────────────┐
│ Raft Log │
├─────────────────────────────────────────────────────────┤
│ Index │ Term │ Command │
│ ──────┼──────┼──────────────────────────────────── │
│ 0 │ 1 │ AddServer("A") │
│ 1 │ 1 │ AddServer("B") │
│ 2 │ 2 │ AddClient("X") │
│ 3 │ 2 │ RemoveServer("A") │
└─────────────────────────────────────────────────────────┘
Raft States
┌──────────────┐
│ │
┌─────│ Leader │◀─────┐
│ │ │ │
│ └──────────────┘ │
│ │
┌─────┴─────┐ ┌───────┴────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Candidate │ │ Follower │ │ Candidate │
│ │ │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
(Election) (Normal ops) (Election)
Leader Election in Raft
## Simplified leader election
class Node:
def __init__(self, node_id):
self.node_id = node_id
self.current_term = 0
self.voted_for = None
self.state = "follower"
self.votes_received = 0
def start_election(self):
self.current_term += 1
self.state = "candidate"
self.voted_for = self.node_id
self.votes_received = 1
# Request votes from other nodes
for peer in self.peers:
if peer.request_vote(self.current_term, self.node_id):
self.votes_received += 1
# If majority, become leader
if self.votes_received > len(self.peers) / 2:
self.become_leader()
def become_leader(self):
self.state = "leader"
# Send heartbeats to prevent new elections
while True:
self.send_append_entries()
time.sleep(HEARTBEAT_INTERVAL)
Leader Election Patterns
Basic Leader Election
import etcd3
class LeaderElection:
def __init__(self, etcd_client, election_name, node_id):
self.client = etcd_client
self.name = election_name
self.node_id = node_id
self.is_leader = False
def elect(self):
# Try to create ephemeral node
try:
self.client.put(
f"/elections/{self.name}",
self.node_id,
prevExist=False
)
self.is_leader = True
print(f"Node {self.node_id} became leader")
return True
except Exception:
# Check who is leader
value, _ = self.client.get(f"/elections/{self.name}")
print(f"Current leader: {value}")
return False
def watch_for_election(self):
# Watch for leadership changes
watch_id = self.client.watch(
f"/elections/{self.name}",
self.on_leader_change
)
Redis-based Leader Election
import redis
class RedisLeaderElection:
def __init__(self, redis_client, election_name, node_id):
self.redis = redis_client
self.key = f"election:{election_name}"
self.node_id = node_id
def become_candidate(self):
# Use Redis SET NX with TTL
result = self.redis.set(
self.key,
self.node_id,
nx=True,
ex=10 # 10 second lease
)
if result:
return True
# Check current leader
leader = self.redis.get(self.key)
return leader.decode() if leader else None
def renew_lease(self):
# Extend our lease periodically
self.redis.expire(self.key, 10)
Fault Tolerance
Failure Detection
import asyncio
class FailureDetector:
def __init__(self, nodes, threshold=3):
self.nodes = {node: 0 for node in nodes}
self.threshold = threshold
def record_heartbeat(self, node):
self.nodes[node] = 0
def check_failures(self):
for node, misses in self.nodes.items():
self.nodes[node] += 1
if self.nodes[node] >= self.threshold:
self.mark_node_failed(node)
def mark_node_failed(self, node):
print(f"Node {node} is suspected to have failed")
# Trigger failover or alert
Circuit Breaker
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.success_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.state = CircuitState.CLOSED
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
Distributed Transactions
Two-Phase Commit (2PC)
Phase 1: Prepare
┌────────┐ ┌────────┐ ┌────────┐
│ Coor- │────▶│ Node 1 │ │ Node 2 │
│ dinator│◀────│ │◀────│ │
└────────┘ └────────┘ └────────┘
│
▼
┌───────────┐
│ Prepare │
│ (Lock) │
└───────────┘
Phase 2: Commit
┌────────┐ ┌────────┐ ┌────────┐
│ Coor- │────▶│ Node 1 │ │ Node 2 │
│ dinator│────▶│ │────▶│ │
└────────┘ └────────┘ └────────┘
│
▼
┌───────────┐
│ Commit/ │
│ Rollback │
└───────────┘
Saga Pattern
## Orchestration-based Saga
class OrderSaga:
def __init__(self):
self.steps = [
self.create_order,
self.reserve_inventory,
self.process_payment,
self.ship_order
]
async def execute(self, order_data):
completed_steps = []
try:
for step in self.steps:
await step(order_data)
completed_steps.append(step)
except Exception as e:
# Compensate in reverse order
for step in reversed(completed_steps):
await self.compensate(step, order_data)
raise
async def create_order(self, data):
order = await db.orders.create(data)
data['order_id'] = order.id
async def compensate(self, step, data):
compensations = {
self.create_order: lambda d: db.orders.cancel(d['order_id']),
self.reserve_inventory: lambda d: db.inventory.release(d['sku']),
self.process_payment: lambda d: db.payments.refund(d['payment_id']),
}
await compensations[step](/architecture/data)
Data Replication
Primary-Replica Replication
Synchronous Replication:
┌────────┐ Write ┌────────┐ Write ┌────────┐
│Client │───────▶│Primary │────────▶│Replica │
│ │◀───────│ │◀────────│ │
└────────┘ ACK └────────┘ ACK └────────┘
Wait for all replicas
Asynchronous Replication:
┌────────┐ Write ┌────────┐ Write ┌────────┐
│Client │───────▶│Primary │────────▶│Replica │
│ │◀───────│ │ │ │
└────────┘ ACK └────────┘ └────────┘
Immediate return
Conflict Resolution
## Last-Write-Wins
class LWWResolver:
def resolve(self, values):
return max(values, key=lambda v: v.timestamp)
## Vector Clocks
class VectorClock:
def __init__(self):
self.clock = {}
def increment(self, node):
self.clock[node] = self.clock.get(node, 0) + 1
def merge(self, other):
for node, time in other.clock.items():
self.clock[node] = max(self.clock.get(node, 0), time)
def happens_before(self, other):
# Check if self happened before other
return all(
self.clock.get(n, 0) <= other.clock.get(n, 0)
for n in set(self.clock.keys()) | set(other.clock.keys())
)
Conclusion
Distributed systems are built on foundational tradeoffs: consistency vs. availability (CAP), latency vs. throughput, and partition tolerance vs. coordination overhead. Understanding these tradeoffs is more important than memorizing specific algorithms. Choose the right consistency model for each data path—strong consistency where correctness matters, eventual consistency where availability and speed are paramount.
External Resources
- Designing Data-Intensive Applications
- Raft Paper
- Introduction to Reliable and Secure Distributed Systems
- Distributed Systems for Practitioners
Comments