Introduction
System design is the process of defining architecture, components, and data flow for a system. Whether you are building a startup MVP or an enterprise platform serving millions, understanding system design principles helps you create scalable, maintainable applications.
In 2026, system design sits at the intersection of mature cloud-native practices and AI-native workloads. LLM serving, RAG pipelines, and autonomous agents sit directly in the request path. The bar for designing resilient systems is higher than ever.
This guide covers the fundamentals — from scalability basics and capacity estimation to real-world system design examples.
Scalability Fundamentals
Horizontal vs Vertical Scaling
Vertical scaling (scaling up) adds more resources to an existing machine: more CPU cores, RAM, or faster storage. It is simple to implement but has hardware limits and creates a single point of failure.
Horizontal scaling (scaling out) adds more machines to distribute the load. It is more complex but provides better fault tolerance and is the preferred approach for large systems.
class ScalingStrategy:
def __init__(self, current_capacity):
self.capacity = current_capacity
def vertical(self, additional_ram_gb):
self.capacity += additional_ram_gb * 1000
return self.capacity
def horizontal(self, node_count):
self.capacity *= node_count
return self.capacity
Back-of-the-Envelope Estimation
Capacity estimation separates good designs from great ones. Before building, estimate whether the architecture can handle the expected load.
Latency numbers every developer should know (2026 benchmarks reflect modern hardware):
| Operation | Time |
|---|---|
| L1 cache reference | 0.5 ns |
| Branch mispredict | 3 ns |
| L2 cache reference | 7 ns |
| Mutex lock/unlock | 25 ns |
| Main memory reference | 100 ns |
| SSD random read | 16 μs |
| Network round trip (same DC) | 500 μs |
| Network round trip (cross-region) | 50-100 ms |
| Disk seek | 4 ms |
| S3 read (small object) | 10-50 ms |
Memory and storage sizes:
| Object | Size |
|---|---|
| UUID | 36 bytes |
| 1 text character | 1 byte |
| Typical HTML page | 50 KB |
| Typical image | 200 KB - 1 MB |
| Typical video (per minute) | 10 MB - 50 MB |
| Typical database row | 1 KB |
Capacity estimation example — design a URL shortener:
def estimate_url_shortener():
daily_active_users = 100_000_000
daily_writes = daily_active_users * 0.1 # 10M writes/day
daily_reads = daily_writes * 100 # 1B reads/day
write_qps = daily_writes / 86400 # ~115 QPS
read_qps = daily_reads / 86400 # ~11,574 QPS
peak_qps = read_qps * 2 # ~23,148 QPS
storage_per_url = 500 # bytes
yearly_storage = daily_writes * 365 * storage_per_url # ~1.8 TB
return {
"write_qps": write_qps,
"read_qps": read_qps,
"peak_qps": peak_qps,
"yearly_storage_gb": yearly_storage / 1e9
}
estimates = estimate_url_shortener()
print(estimates)
Use these estimates to decide on server count, cache sizes, and database sharding strategy. Add 30% headroom for traffic spikes and maintenance.
Load Balancing
Load balancers distribute incoming traffic across multiple servers to prevent any single server from becoming a bottleneck.
import random
from collections import defaultdict
class LoadBalancer:
def __init__(self, servers):
self.servers = servers
self.current = 0
self.connections = defaultdict(int)
def round_robin(self):
server = self.servers[self.current]
self.current = (self.current + 1) % len(self.servers)
return server
def least_connections(self):
return min(self.servers, key=lambda s: self.connections[s])
def weighted(self, weights):
total = sum(weights.values())
r = random.uniform(0, total)
upto = 0
for server, weight in weights.items():
upto += weight
if r <= upto:
return server
Load Balancing Algorithms
| Algorithm | Distribution | Use Case |
|---|---|---|
| Round Robin | Sequential | Equal-capacity servers |
| Least Connections | Fewest active requests | Variable request duration |
| IP Hash | Same client to same server | Session persistence |
| Weighted Round Robin | Performance-based distribution | Heterogeneous servers |
| Random | Randomized selection | Large server pools |
Health checks are critical — load balancers should remove unresponsive servers from rotation and reintroduce them only after they pass consecutive health checks.
Caching
Caching stores frequently accessed data in high-speed storage to reduce latency and decrease load on primary data sources.
import redis
import json
import time
class CacheAside:
"""Application manages cache — check first, fetch on miss."""
def __init__(self, redis_client, db_client):
self.cache = redis_client
self.db = db_client
def get_user(self, user_id):
cache_key = f"user:{user_id}"
cached = self.cache.get(cache_key)
if cached:
return json.loads(cached)
user = self.db.query("SELECT * FROM users WHERE id = ?", user_id)
if user:
self.cache.setex(cache_key, 3600, json.dumps(user))
return user
def invalidate(self, user_id):
self.cache.delete(f"user:{user_id}")
Cache Strategies
| Strategy | Behavior | Use Case |
|---|---|---|
| Cache-aside | App checks cache, writes on miss | Read-heavy workloads |
| Write-through | Write to cache and DB synchronously | Read-write balance |
| Write-behind | Write to cache, async DB write | High write throughput |
| Read-through | Cache fetches from DB on miss | Simplifies app logic |
| TTL | Time-based automatic eviction | Stale data is acceptable |
Cache eviction policies:
- LRU (Least Recently Used) — evicts items not accessed longest
- LFU (Least Frequently Used) — evicts items accessed least often
- FIFO (First In, First Out) — evicts oldest items regardless of access
- TTL (Time To Live) — evicts items after a fixed duration
When caching helps:
- Same data read repeatedly (user profiles, product catalogs)
- Computationally expensive results (aggregated analytics)
- API responses that change infrequently
When caching hurts:
- Rapidly changing data (stock prices, live scores)
- Data that requires strong consistency
- Cache stampedes — many clients requesting the same expired key simultaneously
Content Delivery Network (CDN)
A CDN is a globally distributed network of proxy servers that caches static (and sometimes dynamic) content at edge locations close to users. CDNs reduce latency, offload origin servers, and absorb traffic spikes.
class CDNEdge:
def __init__(self, origin_url):
self.origin = origin_url
self.cache = {}
def serve(self, path, user_region):
edge_key = f"{path}:{user_region}"
if edge_key in self.cache:
return self.cache[edge_key]
content = self.fetch_from_origin(path)
self.cache[edge_key] = content
return content
def prefetch(self, paths):
for path in paths:
content = self.fetch_from_origin(path)
for region in self.edge_locations:
self.cache[f"{path}:{region}"] = content
CDNs are essential for global-scale applications. Cloudflare, AWS CloudFront, and Fastly provide edge caching, DDoS protection, and origin shielding.
Rate Limiting
Rate limiting controls how many requests a client can make within a time window. It prevents abuse, ensures fair usage, and protects backend services from traffic spikes.
import time
from collections import defaultdict
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.last_refill = time.time()
def allow_request(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
class SlidingWindow:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window = window_seconds
self.requests = defaultdict(list)
def allow(self, client_id):
now = time.time()
window_start = now - self.window
self.requests[client_id] = [
t for t in self.requests[client_id] if t > window_start
]
if len(self.requests[client_id]) < self.limit:
self.requests[client_id].append(now)
return True
return False
Rate limiting algorithms:
| Algorithm | Behavior | Trade-offs |
|---|---|---|
| Token bucket | Tokens refill at a fixed rate | Allows bursts up to capacity |
| Leaky bucket | Requests processed at fixed rate | Smooths traffic, no bursts |
| Fixed window | Count per clock-aligned window | Traffic spike at window boundary |
| Sliding window | Count per rolling time window | More accurate, more memory |
Rate limiting is typically implemented at the API gateway or load balancer level.
Database Design
SQL vs NoSQL
| Dimension | SQL | NoSQL |
|---|---|---|
| Schema | Fixed, predefined | Flexible, dynamic |
| Consistency | Strong (ACID) | Varies (eventual to strong) |
| Queries | Complex joins, aggregations | Simple key-based lookups |
| Scaling | Vertical (primary) | Horizontal (sharding) |
| Best for | Transactions, relationships | High throughput, flexible data |
When to Use Each
Choose SQL (PostgreSQL, MySQL) for transactional data, complex relationships, and ACID requirements. Financial systems, inventory management, and user accounts are SQL domains.
Choose NoSQL (Cassandra, DynamoDB, MongoDB) for high write throughput, flexible schemas, and horizontal scaling. Time-series data, session stores, and product catalogs are NoSQL domains.
Database Patterns
Read replicas distribute read traffic across copies of the primary database. All writes go to the primary; reads can go to any replica. This pattern improves read throughput but introduces replication lag.
class ReadReplicaManager:
def __init__(self, primary, replicas):
self.primary = primary
self.replicas = replicas
def write(self, query, params):
return self.primary.execute(query, params)
def read(self, query, params):
replica = random.choice(self.replicas)
return replica.execute(query, params)
Sharding splits data across multiple database instances. Each shard holds a subset of data. The shard key must be chosen carefully to avoid hot spots.
import hashlib
class ShardManager:
def __init__(self, shards):
self.shards = shards
def get_shard(self, key):
shard_id = int(hashlib.md5(key.encode()).hexdigest(), 16) % len(self.shards)
return self.shards[shard_id]
def write(self, key, data):
shard = self.get_shard(key)
shard.write(key, data)
Sharding strategies:
- Range-based: Split by key range (users A-M on shard 1, N-Z on shard 2). Simple but can cause hot spots.
- Hash-based: Apply hash function to shard key. Better distribution but breaks range queries.
- Directory-based: Lookup table maps keys to shards. Most flexible but adds a lookup hop.
Connection pooling reuses database connections instead of opening a new one per request. It reduces connection overhead and prevents database overload.
from queue import Queue
import psycopg2
class ConnectionPool:
def __init__(self, conn_string, pool_size=10):
self.pool = Queue(maxsize=pool_size)
for _ in range(pool_size):
conn = psycopg2.connect(conn_string)
self.pool.put(conn)
def acquire(self):
return self.pool.get()
def release(self, conn):
self.pool.put(conn)
Consistent Hashing
Consistent hashing distributes keys across nodes with minimal redistribution when nodes join or leave. Each node is assigned positions on a hash ring; each key is assigned to the nearest node clockwise.
import hashlib
import bisect
class ConsistentHashRing:
def __init__(self, virtual_nodes=150):
self.ring = {}
self.sorted_keys = []
self.virtual_nodes = virtual_nodes
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node_id):
for i in range(self.virtual_nodes):
vnode = f"{node_id}:{i}"
h = self._hash(vnode)
self.ring[h] = node_id
self.sorted_keys = sorted(self.ring.keys())
def remove_node(self, node_id):
for i in range(self.virtual_nodes):
vnode = f"{node_id}:{i}"
h = self._hash(vnode)
self.ring.pop(h, None)
self.sorted_keys = sorted(self.ring.keys())
def get_node(self, key):
if not self.ring:
return None
h = self._hash(key)
idx = bisect.bisect_right(self.sorted_keys, h)
if idx == len(self.sorted_keys):
idx = 0
return self.ring[self.sorted_keys[idx]]
Consistent hashing is used by Amazon DynamoDB, Cassandra, and Discord for data partitioning. Virtual nodes help balance load when physical nodes have different capacities.
Message Queues and Async Processing
Asynchronous processing decouples components and smooths traffic spikes. Instead of waiting for a task to complete, the system places the task into a queue and workers process it independently.
class MessageQueue:
def __init__(self):
self.queues = {}
self.dead_letter = []
def publish(self, topic, message):
if topic not in self.queues:
self.queues[topic] = []
self.queues[topic].append(message)
def subscribe(self, topic, worker):
messages = self.queues.get(topic, [])
for msg in messages:
try:
worker.process(msg)
self.queues[topic].remove(msg)
except Exception as e:
self.dead_letter.append({"topic": topic, "message": msg, "error": str(e)})
When to use async processing:
- Email or push notification delivery
- Image/video transcoding
- Report generation
- Payment processing confirmation
- Data synchronization between services
Message queue vs pub/sub:
| Pattern | Delivery | Use Case |
|---|---|---|
| Message queue | One consumer per message | Task distribution |
| Pub/Sub | All subscribers receive | Event broadcasting |
Common tools: RabbitMQ, Apache Kafka, Amazon SQS, Google Pub/Sub, Redis Streams.
API Design Patterns
REST
REST uses standard HTTP methods (GET, POST, PUT, DELETE) and stateless communication. It is simple, cacheable, and widely supported.
from flask import Flask, jsonify, request
app = Flask(__name__)
@app.route("/api/users/<user_id>", methods=["GET"])
def get_user(user_id):
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
if not user:
return jsonify({"error": "not found"}), 404
return jsonify(user)
@app.route("/api/users", methods=["POST"])
def create_user():
data = request.get_json()
user_id = db.execute("INSERT INTO users (name, email) VALUES (?, ?)",
data["name"], data["email"])
return jsonify({"id": user_id}), 201
GraphQL
GraphQL lets clients request exactly the data they need. It reduces over-fetching and under-fetching but adds query complexity.
schema = """
type Query {
user(id: ID!): User
users(limit: Int): [User]
}
type User {
id: ID!
name: String
orders: [Order]
}
"""
gRPC
gRPC uses Protocol Buffers for efficient binary serialization. It supports streaming, bidirectional communication, and is ideal for high-throughput internal services.
service UserService {
rpc GetUser (GetUserRequest) returns (User);
rpc ListUsers (ListUsersRequest) returns (stream User);
}
message GetUserRequest {
string user_id = 1;
}
When to use each:
| Protocol | Strengths | Weaknesses | Best For |
|---|---|---|---|
| REST | Simple, cacheable, ubiquitous | Over-fetching, no real-time | Public APIs |
| GraphQL | Flexible queries, strong typing | Complex caching, N+1 problem | Complex UIs |
| gRPC | Fast, streaming, typed | Binary encoding, less browser support | Internal services |
Microservices Communication
Synchronous vs Asynchronous
Synchronous (REST/gRPC) — services call each other directly. Simple to implement but creates tight coupling and cascading failure risk.
Asynchronous (message queues/events) — services communicate through events. Loose coupling enables independent scaling and better fault isolation.
import asyncio
class OrderService:
def __init__(self, event_bus, payment_service, inventory_service):
self.events = event_bus
self.payments = payment_service
self.inventory = inventory_service
async def place_order(self, order_data):
order = await self.save_order(order_data)
self.events.publish("order.created", {
"order_id": order.id,
"user_id": order.user_id,
"total": order.total
})
print("Order placed — async processing will handle payment and inventory")
Service Discovery
In dynamic environments where services scale up and down, service discovery lets services find each other without hardcoded addresses.
# Kubernetes service definition
apiVersion: v1
kind: Service
metadata:
name: user-service
spec:
selector:
app: user-service
ports:
- port: 80
targetPort: 8080
Tools: Kubernetes DNS, Consul, ZooKeeper, Netflix Eureka.
Reliability Patterns
Circuit Breaker
A circuit breaker prevents cascading failures by failing fast when a downstream dependency is unhealthy.
from enum import Enum
import time
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, success_threshold=3, timeout=30):
self.failure_count = 0
self.success_count = 0
self.failure_threshold = failure_threshold
self.success_threshold = success_threshold
self.timeout = timeout
self.state = CircuitState.CLOSED
self.last_failure = None
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if self.last_failure and time.time() - self.last_failure > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN — request blocked")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
else:
self.failure_count = 0
def on_failure(self):
self.failure_count += 1
self.last_failure = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
Retry with Exponential Backoff
Transient failures (network timeouts, connection resets) can be handled with retries, but retrying without backoff can cause a retry storm.
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=0.1):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.1)
time.sleep(delay)
Timeouts
Every external call must have a timeout. Without one, a slow dependency can exhaust connection pools and cascade failures.
import asyncio
async def call_with_timeout(func, timeout_ms=500):
try:
return await asyncio.wait_for(func(), timeout=timeout_ms / 1000)
except asyncio.TimeoutError:
raise Exception("Request timed out")
Availability Numbers
| Uptime % | Downtime per year | Example |
|---|---|---|
| 99% | 3.65 days | Internal tools |
| 99.9% | 8.76 hours | Many SaaS products |
| 99.99% | 52.56 minutes | Enterprise systems |
| 99.999% | 5.26 minutes | Critical infrastructure |
System Design Process
Step 1: Requirements Clarification
Define the scope before drawing any boxes. Ask about:
- Functional requirements: what features does the system need?
- Non-functional requirements: what scale, latency, availability, and consistency?
- Constraints: time, budget, team size, existing infrastructure
Step 2: Capacity Estimation
Estimate traffic, storage, bandwidth, and server count. Use back-of-the-envelope calculations to validate assumptions before designing.
Step 3: High-Level Design
Draw the system architecture: clients, load balancers, API gateways, application servers, databases, caches, message queues, and CDNs. Explain data flow end to end.
Step 4: Deep Dive
Focus on one or two components. Discuss database schema, caching strategy, sharding approach, and consistency tradeoffs. Justify every decision.
Step 5: Bottlenecks and Scale
Identify single points of failure. Discuss how the system handles 10x the current load. Address data replication, partition tolerance, and disaster recovery.
Step 6: Wrap Up
Summarize the design, restate tradeoffs, and suggest future improvements.
Common System Designs
URL Shortener
flowchart LR
Client --> LB[Load Balancer]
LB --> API[API Server]
API --> Cache[Redis Cache]
API --> DB[(Database)]
API --> MQ[Message Queue]
MQ --> Analytics[Analytics Worker]
Components:
- Hash function (Base62 encoding of unique ID) generates short codes
- Database stores mapping of short code to long URL
- Cache serves popular URLs — 95%+ cache hit rate expected
- 301 redirect for permanent URLs, 302 for analytics tracking
- Analytics worker processes click data asynchronously
Capacity: 100M daily active users, 10M new URLs/day, 1B redirects/day. Database sharded by URL hash. Cache with 80% hit rate reduces DB load by 5x.
News Feed (Twitter-style)
flowchart LR
Client --> LB[Load Balancer]
LB --> API[API Server]
API --> Cache[Feed Cache]
API --> Graph[(Social Graph DB)]
API --> Timeline[(Timeline DB)]
API --> MQ[Fanout Queue]
MQ --> Worker[Fanout Worker]
Two approaches:
Pull — Generate feed on read. When a user opens the app, fetch recent posts from followed users and merge. Works well for users who follow few accounts or check infrequently.
Push — Pre-compute feed on write. When a user posts, fan out the post to all followers’ timelines. Works well for users with few followers but fan-out to celebrities with millions of followers requires a hybrid approach.
Fanout service:
class FanoutService:
def fanout_to_followers(self, post, author_id, followers):
for follower in followers:
timeline_key = f"timeline:{follower}"
self.cache.lpush(timeline_key, post)
self.cache.ltrim(timeline_key, 0, 500)
Cache performs better when feeds are pre-computed (fanout on write), but for users with millions of followers, fanout on read avoids the thundering herd problem of writing to millions of timeline caches simultaneously.
Modern Trends (2025-2026)
AI Infrastructure Design
LLM serving, embedding pipelines, and RAG architectures are now common system design topics. Key considerations:
- GPU resource management — batch inference, model parallelism, quantization
- Vector databases — embedding storage and similarity search (Pinecone, Weaviate, Milvus)
- RAG pipelines — indexing, chunking, retrieval, and augmentation
- Cost bounding — model inference often dwarfs other infrastructure costs
An architecture for an LLM-powered chatbot at scale:
flowchart LR
Client --> LB[Load Balancer]
LB --> Router[Request Router]
Router --> Cache[Response Cache]
Router --> RAG[RAG Pipeline]
RAG --> VectorDB[(Vector DB)]
RAG --> LLM[Inference Cluster]
LLM --> GPU[GPU Workers]
Object Storage as the Database
Systems like Amazon Aurora, Snowflake, and WarpStream use object storage (S3, Azure Blob) as a core architectural component. Compute is separated from storage, enabling independent scaling of each layer.
Decomposing Databases
Databases are disaggregating into separate engines for parsing, optimization, execution, and storage. This decomposition allows each component to scale independently and be optimized for specific workloads.
Performance Optimization Checklist
- Cache frequently accessed data at multiple layers (CDN, app, database)
- Use read replicas for read-heavy workloads
- Add connection pooling to manage database connections
- Set timeouts on every external call (start with 500ms)
- Implement circuit breakers for downstream dependencies
- Use exponential backoff with jitter for retries
- Optimize slow queries with proper indexing
- Use async processing for non-critical operations
- Compress responses (gzip/brotli)
- Monitor latency distributions, not just averages
Resources
- System Design Primer
- Designing Data-Intensive Applications
- High Scalability Blog
- 50 System Design Patterns (2026 Edition)
- ByteByteGo System Design Course
- The Complete Guide to System Design in 2026
Related Articles
- Distributed Systems Fundamentals
- System Design Interview Guide
- Load Balancing Strategies
- Microservices vs Monolith Architecture
Comments