Skip to main content

Understanding System Design: Scalable Applications

Created: March 9, 2026 Larry Qu 15 min read

Introduction

System design is the process of defining architecture, components, and data flow for a system. Whether you are building a startup MVP or an enterprise platform serving millions, understanding system design principles helps you create scalable, maintainable applications.

In 2026, system design sits at the intersection of mature cloud-native practices and AI-native workloads. LLM serving, RAG pipelines, and autonomous agents sit directly in the request path. The bar for designing resilient systems is higher than ever.

This guide covers the fundamentals — from scalability basics and capacity estimation to real-world system design examples.

Scalability Fundamentals

Horizontal vs Vertical Scaling

Vertical scaling (scaling up) adds more resources to an existing machine: more CPU cores, RAM, or faster storage. It is simple to implement but has hardware limits and creates a single point of failure.

Horizontal scaling (scaling out) adds more machines to distribute the load. It is more complex but provides better fault tolerance and is the preferred approach for large systems.

class ScalingStrategy:
    def __init__(self, current_capacity):
        self.capacity = current_capacity

    def vertical(self, additional_ram_gb):
        self.capacity += additional_ram_gb * 1000
        return self.capacity

    def horizontal(self, node_count):
        self.capacity *= node_count
        return self.capacity

Back-of-the-Envelope Estimation

Capacity estimation separates good designs from great ones. Before building, estimate whether the architecture can handle the expected load.

Latency numbers every developer should know (2026 benchmarks reflect modern hardware):

Operation Time
L1 cache reference 0.5 ns
Branch mispredict 3 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
SSD random read 16 μs
Network round trip (same DC) 500 μs
Network round trip (cross-region) 50-100 ms
Disk seek 4 ms
S3 read (small object) 10-50 ms

Memory and storage sizes:

Object Size
UUID 36 bytes
1 text character 1 byte
Typical HTML page 50 KB
Typical image 200 KB - 1 MB
Typical video (per minute) 10 MB - 50 MB
Typical database row 1 KB

Capacity estimation example — design a URL shortener:

def estimate_url_shortener():
    daily_active_users = 100_000_000
    daily_writes = daily_active_users * 0.1     # 10M writes/day
    daily_reads = daily_writes * 100               # 1B reads/day

    write_qps = daily_writes / 86400               # ~115 QPS
    read_qps = daily_reads / 86400                 # ~11,574 QPS
    peak_qps = read_qps * 2                        # ~23,148 QPS

    storage_per_url = 500                          # bytes
    yearly_storage = daily_writes * 365 * storage_per_url  # ~1.8 TB

    return {
        "write_qps": write_qps,
        "read_qps": read_qps,
        "peak_qps": peak_qps,
        "yearly_storage_gb": yearly_storage / 1e9
    }

estimates = estimate_url_shortener()
print(estimates)

Use these estimates to decide on server count, cache sizes, and database sharding strategy. Add 30% headroom for traffic spikes and maintenance.

Load Balancing

Load balancers distribute incoming traffic across multiple servers to prevent any single server from becoming a bottleneck.

import random
from collections import defaultdict

class LoadBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.current = 0
        self.connections = defaultdict(int)

    def round_robin(self):
        server = self.servers[self.current]
        self.current = (self.current + 1) % len(self.servers)
        return server

    def least_connections(self):
        return min(self.servers, key=lambda s: self.connections[s])

    def weighted(self, weights):
        total = sum(weights.values())
        r = random.uniform(0, total)
        upto = 0
        for server, weight in weights.items():
            upto += weight
            if r <= upto:
                return server

Load Balancing Algorithms

Algorithm Distribution Use Case
Round Robin Sequential Equal-capacity servers
Least Connections Fewest active requests Variable request duration
IP Hash Same client to same server Session persistence
Weighted Round Robin Performance-based distribution Heterogeneous servers
Random Randomized selection Large server pools

Health checks are critical — load balancers should remove unresponsive servers from rotation and reintroduce them only after they pass consecutive health checks.

Caching

Caching stores frequently accessed data in high-speed storage to reduce latency and decrease load on primary data sources.

import redis
import json
import time

class CacheAside:
    """Application manages cache — check first, fetch on miss."""
    def __init__(self, redis_client, db_client):
        self.cache = redis_client
        self.db = db_client

    def get_user(self, user_id):
        cache_key = f"user:{user_id}"
        cached = self.cache.get(cache_key)
        if cached:
            return json.loads(cached)

        user = self.db.query("SELECT * FROM users WHERE id = ?", user_id)
        if user:
            self.cache.setex(cache_key, 3600, json.dumps(user))
        return user

    def invalidate(self, user_id):
        self.cache.delete(f"user:{user_id}")

Cache Strategies

Strategy Behavior Use Case
Cache-aside App checks cache, writes on miss Read-heavy workloads
Write-through Write to cache and DB synchronously Read-write balance
Write-behind Write to cache, async DB write High write throughput
Read-through Cache fetches from DB on miss Simplifies app logic
TTL Time-based automatic eviction Stale data is acceptable

Cache eviction policies:

  • LRU (Least Recently Used) — evicts items not accessed longest
  • LFU (Least Frequently Used) — evicts items accessed least often
  • FIFO (First In, First Out) — evicts oldest items regardless of access
  • TTL (Time To Live) — evicts items after a fixed duration

When caching helps:

  • Same data read repeatedly (user profiles, product catalogs)
  • Computationally expensive results (aggregated analytics)
  • API responses that change infrequently

When caching hurts:

  • Rapidly changing data (stock prices, live scores)
  • Data that requires strong consistency
  • Cache stampedes — many clients requesting the same expired key simultaneously

Content Delivery Network (CDN)

A CDN is a globally distributed network of proxy servers that caches static (and sometimes dynamic) content at edge locations close to users. CDNs reduce latency, offload origin servers, and absorb traffic spikes.

class CDNEdge:
    def __init__(self, origin_url):
        self.origin = origin_url
        self.cache = {}

    def serve(self, path, user_region):
        edge_key = f"{path}:{user_region}"
        if edge_key in self.cache:
            return self.cache[edge_key]

        content = self.fetch_from_origin(path)
        self.cache[edge_key] = content
        return content

    def prefetch(self, paths):
        for path in paths:
            content = self.fetch_from_origin(path)
            for region in self.edge_locations:
                self.cache[f"{path}:{region}"] = content

CDNs are essential for global-scale applications. Cloudflare, AWS CloudFront, and Fastly provide edge caching, DDoS protection, and origin shielding.

Rate Limiting

Rate limiting controls how many requests a client can make within a time window. It prevents abuse, ensures fair usage, and protects backend services from traffic spikes.

import time
from collections import defaultdict

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill = time.time()

    def allow_request(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

class SlidingWindow:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.requests = defaultdict(list)

    def allow(self, client_id):
        now = time.time()
        window_start = now - self.window

        self.requests[client_id] = [
            t for t in self.requests[client_id] if t > window_start
        ]

        if len(self.requests[client_id]) < self.limit:
            self.requests[client_id].append(now)
            return True
        return False

Rate limiting algorithms:

Algorithm Behavior Trade-offs
Token bucket Tokens refill at a fixed rate Allows bursts up to capacity
Leaky bucket Requests processed at fixed rate Smooths traffic, no bursts
Fixed window Count per clock-aligned window Traffic spike at window boundary
Sliding window Count per rolling time window More accurate, more memory

Rate limiting is typically implemented at the API gateway or load balancer level.

Database Design

SQL vs NoSQL

Dimension SQL NoSQL
Schema Fixed, predefined Flexible, dynamic
Consistency Strong (ACID) Varies (eventual to strong)
Queries Complex joins, aggregations Simple key-based lookups
Scaling Vertical (primary) Horizontal (sharding)
Best for Transactions, relationships High throughput, flexible data

When to Use Each

Choose SQL (PostgreSQL, MySQL) for transactional data, complex relationships, and ACID requirements. Financial systems, inventory management, and user accounts are SQL domains.

Choose NoSQL (Cassandra, DynamoDB, MongoDB) for high write throughput, flexible schemas, and horizontal scaling. Time-series data, session stores, and product catalogs are NoSQL domains.

Database Patterns

Read replicas distribute read traffic across copies of the primary database. All writes go to the primary; reads can go to any replica. This pattern improves read throughput but introduces replication lag.

class ReadReplicaManager:
    def __init__(self, primary, replicas):
        self.primary = primary
        self.replicas = replicas

    def write(self, query, params):
        return self.primary.execute(query, params)

    def read(self, query, params):
        replica = random.choice(self.replicas)
        return replica.execute(query, params)

Sharding splits data across multiple database instances. Each shard holds a subset of data. The shard key must be chosen carefully to avoid hot spots.

import hashlib

class ShardManager:
    def __init__(self, shards):
        self.shards = shards

    def get_shard(self, key):
        shard_id = int(hashlib.md5(key.encode()).hexdigest(), 16) % len(self.shards)
        return self.shards[shard_id]

    def write(self, key, data):
        shard = self.get_shard(key)
        shard.write(key, data)

Sharding strategies:

  • Range-based: Split by key range (users A-M on shard 1, N-Z on shard 2). Simple but can cause hot spots.
  • Hash-based: Apply hash function to shard key. Better distribution but breaks range queries.
  • Directory-based: Lookup table maps keys to shards. Most flexible but adds a lookup hop.

Connection pooling reuses database connections instead of opening a new one per request. It reduces connection overhead and prevents database overload.

from queue import Queue
import psycopg2

class ConnectionPool:
    def __init__(self, conn_string, pool_size=10):
        self.pool = Queue(maxsize=pool_size)
        for _ in range(pool_size):
            conn = psycopg2.connect(conn_string)
            self.pool.put(conn)

    def acquire(self):
        return self.pool.get()

    def release(self, conn):
        self.pool.put(conn)

Consistent Hashing

Consistent hashing distributes keys across nodes with minimal redistribution when nodes join or leave. Each node is assigned positions on a hash ring; each key is assigned to the nearest node clockwise.

import hashlib
import bisect

class ConsistentHashRing:
    def __init__(self, virtual_nodes=150):
        self.ring = {}
        self.sorted_keys = []
        self.virtual_nodes = virtual_nodes

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node_id):
        for i in range(self.virtual_nodes):
            vnode = f"{node_id}:{i}"
            h = self._hash(vnode)
            self.ring[h] = node_id
        self.sorted_keys = sorted(self.ring.keys())

    def remove_node(self, node_id):
        for i in range(self.virtual_nodes):
            vnode = f"{node_id}:{i}"
            h = self._hash(vnode)
            self.ring.pop(h, None)
        self.sorted_keys = sorted(self.ring.keys())

    def get_node(self, key):
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect_right(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

Consistent hashing is used by Amazon DynamoDB, Cassandra, and Discord for data partitioning. Virtual nodes help balance load when physical nodes have different capacities.

Message Queues and Async Processing

Asynchronous processing decouples components and smooths traffic spikes. Instead of waiting for a task to complete, the system places the task into a queue and workers process it independently.

class MessageQueue:
    def __init__(self):
        self.queues = {}
        self.dead_letter = []

    def publish(self, topic, message):
        if topic not in self.queues:
            self.queues[topic] = []
        self.queues[topic].append(message)

    def subscribe(self, topic, worker):
        messages = self.queues.get(topic, [])
        for msg in messages:
            try:
                worker.process(msg)
                self.queues[topic].remove(msg)
            except Exception as e:
                self.dead_letter.append({"topic": topic, "message": msg, "error": str(e)})

When to use async processing:

  • Email or push notification delivery
  • Image/video transcoding
  • Report generation
  • Payment processing confirmation
  • Data synchronization between services

Message queue vs pub/sub:

Pattern Delivery Use Case
Message queue One consumer per message Task distribution
Pub/Sub All subscribers receive Event broadcasting

Common tools: RabbitMQ, Apache Kafka, Amazon SQS, Google Pub/Sub, Redis Streams.

API Design Patterns

REST

REST uses standard HTTP methods (GET, POST, PUT, DELETE) and stateless communication. It is simple, cacheable, and widely supported.

from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route("/api/users/<user_id>", methods=["GET"])
def get_user(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    if not user:
        return jsonify({"error": "not found"}), 404
    return jsonify(user)

@app.route("/api/users", methods=["POST"])
def create_user():
    data = request.get_json()
    user_id = db.execute("INSERT INTO users (name, email) VALUES (?, ?)",
                         data["name"], data["email"])
    return jsonify({"id": user_id}), 201

GraphQL

GraphQL lets clients request exactly the data they need. It reduces over-fetching and under-fetching but adds query complexity.

schema = """
type Query {
    user(id: ID!): User
    users(limit: Int): [User]
}
type User {
    id: ID!
    name: String
    orders: [Order]
}
"""

gRPC

gRPC uses Protocol Buffers for efficient binary serialization. It supports streaming, bidirectional communication, and is ideal for high-throughput internal services.

service UserService {
    rpc GetUser (GetUserRequest) returns (User);
    rpc ListUsers (ListUsersRequest) returns (stream User);
}

message GetUserRequest {
    string user_id = 1;
}

When to use each:

Protocol Strengths Weaknesses Best For
REST Simple, cacheable, ubiquitous Over-fetching, no real-time Public APIs
GraphQL Flexible queries, strong typing Complex caching, N+1 problem Complex UIs
gRPC Fast, streaming, typed Binary encoding, less browser support Internal services

Microservices Communication

Synchronous vs Asynchronous

Synchronous (REST/gRPC) — services call each other directly. Simple to implement but creates tight coupling and cascading failure risk.

Asynchronous (message queues/events) — services communicate through events. Loose coupling enables independent scaling and better fault isolation.

import asyncio

class OrderService:
    def __init__(self, event_bus, payment_service, inventory_service):
        self.events = event_bus
        self.payments = payment_service
        self.inventory = inventory_service

    async def place_order(self, order_data):
        order = await self.save_order(order_data)
        self.events.publish("order.created", {
            "order_id": order.id,
            "user_id": order.user_id,
            "total": order.total
        })
        print("Order placed — async processing will handle payment and inventory")

Service Discovery

In dynamic environments where services scale up and down, service discovery lets services find each other without hardcoded addresses.

# Kubernetes service definition
apiVersion: v1
kind: Service
metadata:
  name: user-service
spec:
  selector:
    app: user-service
  ports:
  - port: 80
    targetPort: 8080

Tools: Kubernetes DNS, Consul, ZooKeeper, Netflix Eureka.

Reliability Patterns

Circuit Breaker

A circuit breaker prevents cascading failures by failing fast when a downstream dependency is unhealthy.

from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, success_threshold=3, timeout=30):
        self.failure_count = 0
        self.success_count = 0
        self.failure_threshold = failure_threshold
        self.success_threshold = success_threshold
        self.timeout = timeout
        self.state = CircuitState.CLOSED
        self.last_failure = None

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if self.last_failure and time.time() - self.last_failure > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN — request blocked")

        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

    def on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                self.success_count = 0
        else:
            self.failure_count = 0

    def on_failure(self):
        self.failure_count += 1
        self.last_failure = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

Retry with Exponential Backoff

Transient failures (network timeouts, connection resets) can be handled with retries, but retrying without backoff can cause a retry storm.

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=0.1):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.1)
            time.sleep(delay)

Timeouts

Every external call must have a timeout. Without one, a slow dependency can exhaust connection pools and cascade failures.

import asyncio

async def call_with_timeout(func, timeout_ms=500):
    try:
        return await asyncio.wait_for(func(), timeout=timeout_ms / 1000)
    except asyncio.TimeoutError:
        raise Exception("Request timed out")

Availability Numbers

Uptime % Downtime per year Example
99% 3.65 days Internal tools
99.9% 8.76 hours Many SaaS products
99.99% 52.56 minutes Enterprise systems
99.999% 5.26 minutes Critical infrastructure

System Design Process

Step 1: Requirements Clarification

Define the scope before drawing any boxes. Ask about:

  • Functional requirements: what features does the system need?
  • Non-functional requirements: what scale, latency, availability, and consistency?
  • Constraints: time, budget, team size, existing infrastructure

Step 2: Capacity Estimation

Estimate traffic, storage, bandwidth, and server count. Use back-of-the-envelope calculations to validate assumptions before designing.

Step 3: High-Level Design

Draw the system architecture: clients, load balancers, API gateways, application servers, databases, caches, message queues, and CDNs. Explain data flow end to end.

Step 4: Deep Dive

Focus on one or two components. Discuss database schema, caching strategy, sharding approach, and consistency tradeoffs. Justify every decision.

Step 5: Bottlenecks and Scale

Identify single points of failure. Discuss how the system handles 10x the current load. Address data replication, partition tolerance, and disaster recovery.

Step 6: Wrap Up

Summarize the design, restate tradeoffs, and suggest future improvements.

Common System Designs

URL Shortener

flowchart LR
    Client --> LB[Load Balancer]
    LB --> API[API Server]
    API --> Cache[Redis Cache]
    API --> DB[(Database)]
    API --> MQ[Message Queue]
    MQ --> Analytics[Analytics Worker]

Components:

  • Hash function (Base62 encoding of unique ID) generates short codes
  • Database stores mapping of short code to long URL
  • Cache serves popular URLs — 95%+ cache hit rate expected
  • 301 redirect for permanent URLs, 302 for analytics tracking
  • Analytics worker processes click data asynchronously

Capacity: 100M daily active users, 10M new URLs/day, 1B redirects/day. Database sharded by URL hash. Cache with 80% hit rate reduces DB load by 5x.

News Feed (Twitter-style)

flowchart LR
    Client --> LB[Load Balancer]
    LB --> API[API Server]
    API --> Cache[Feed Cache]
    API --> Graph[(Social Graph DB)]
    API --> Timeline[(Timeline DB)]
    API --> MQ[Fanout Queue]
    MQ --> Worker[Fanout Worker]

Two approaches:

Pull — Generate feed on read. When a user opens the app, fetch recent posts from followed users and merge. Works well for users who follow few accounts or check infrequently.

Push — Pre-compute feed on write. When a user posts, fan out the post to all followers’ timelines. Works well for users with few followers but fan-out to celebrities with millions of followers requires a hybrid approach.

Fanout service:

class FanoutService:
    def fanout_to_followers(self, post, author_id, followers):
        for follower in followers:
            timeline_key = f"timeline:{follower}"
            self.cache.lpush(timeline_key, post)
            self.cache.ltrim(timeline_key, 0, 500)

Cache performs better when feeds are pre-computed (fanout on write), but for users with millions of followers, fanout on read avoids the thundering herd problem of writing to millions of timeline caches simultaneously.

AI Infrastructure Design

LLM serving, embedding pipelines, and RAG architectures are now common system design topics. Key considerations:

  • GPU resource management — batch inference, model parallelism, quantization
  • Vector databases — embedding storage and similarity search (Pinecone, Weaviate, Milvus)
  • RAG pipelines — indexing, chunking, retrieval, and augmentation
  • Cost bounding — model inference often dwarfs other infrastructure costs

An architecture for an LLM-powered chatbot at scale:

flowchart LR
    Client --> LB[Load Balancer]
    LB --> Router[Request Router]
    Router --> Cache[Response Cache]
    Router --> RAG[RAG Pipeline]
    RAG --> VectorDB[(Vector DB)]
    RAG --> LLM[Inference Cluster]
    LLM --> GPU[GPU Workers]

Object Storage as the Database

Systems like Amazon Aurora, Snowflake, and WarpStream use object storage (S3, Azure Blob) as a core architectural component. Compute is separated from storage, enabling independent scaling of each layer.

Decomposing Databases

Databases are disaggregating into separate engines for parsing, optimization, execution, and storage. This decomposition allows each component to scale independently and be optimized for specific workloads.

Performance Optimization Checklist

  • Cache frequently accessed data at multiple layers (CDN, app, database)
  • Use read replicas for read-heavy workloads
  • Add connection pooling to manage database connections
  • Set timeouts on every external call (start with 500ms)
  • Implement circuit breakers for downstream dependencies
  • Use exponential backoff with jitter for retries
  • Optimize slow queries with proper indexing
  • Use async processing for non-critical operations
  • Compress responses (gzip/brotli)
  • Monitor latency distributions, not just averages

Resources

Comments

👍 Was this article helpful?