Understanding System Design: Scalable Applications

Introduction

System design is the process of defining architecture, components, and data flow for a system. Whether you are building a startup MVP or an enterprise platform serving millions, understanding system design principles helps you create scalable, maintainable applications.

In 2026, system design sits at the intersection of mature cloud-native practices and AI-native workloads. LLM serving, RAG pipelines, and autonomous agents sit directly in the request path. The bar for designing resilient systems is higher than ever.

This guide covers the fundamentals — from scalability basics and capacity estimation to real-world system design examples.

Scalability Fundamentals

Horizontal vs Vertical Scaling

Vertical scaling (scaling up) adds more resources to an existing machine: more CPU cores, RAM, or faster storage. It is simple to implement but has hardware limits and creates a single point of failure.

Horizontal scaling (scaling out) adds more machines to distribute the load. It is more complex but provides better fault tolerance and is the preferred approach for large systems.

class ScalingStrategy:
    def __init__(self, current_capacity):
        self.capacity = current_capacity

    def vertical(self, additional_ram_gb):
        self.capacity += additional_ram_gb * 1000
        return self.capacity

    def horizontal(self, node_count):
        self.capacity *= node_count
        return self.capacity

Back-of-the-Envelope Estimation

Capacity estimation separates good designs from great ones. Before building, estimate whether the architecture can handle the expected load.

Latency numbers every developer should know (2026 benchmarks reflect modern hardware):

Operation	Time
L1 cache reference	0.5 ns
Branch mispredict	3 ns
L2 cache reference	7 ns
Mutex lock/unlock	25 ns
Main memory reference	100 ns
SSD random read	16 μs
Network round trip (same DC)	500 μs
Network round trip (cross-region)	50-100 ms
Disk seek	4 ms
S3 read (small object)	10-50 ms

Memory and storage sizes:

Object	Size
UUID	36 bytes
1 text character	1 byte
Typical HTML page	50 KB
Typical image	200 KB - 1 MB
Typical video (per minute)	10 MB - 50 MB
Typical database row	1 KB

Capacity estimation example — design a URL shortener:

def estimate_url_shortener():
    daily_active_users = 100_000_000
    daily_writes = daily_active_users * 0.1     # 10M writes/day
    daily_reads = daily_writes * 100               # 1B reads/day

    write_qps = daily_writes / 86400               # ~115 QPS
    read_qps = daily_reads / 86400                 # ~11,574 QPS
    peak_qps = read_qps * 2                        # ~23,148 QPS

    storage_per_url = 500                          # bytes
    yearly_storage = daily_writes * 365 * storage_per_url  # ~1.8 TB

    return {
        "write_qps": write_qps,
        "read_qps": read_qps,
        "peak_qps": peak_qps,
        "yearly_storage_gb": yearly_storage / 1e9
    }

estimates = estimate_url_shortener()
print(estimates)

Use these estimates to decide on server count, cache sizes, and database sharding strategy. Add 30% headroom for traffic spikes and maintenance.

Load Balancing

Load balancers distribute incoming traffic across multiple servers to prevent any single server from becoming a bottleneck.

import random
from collections import defaultdict

class LoadBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.current = 0
        self.connections = defaultdict(int)

    def round_robin(self):
        server = self.servers[self.current]
        self.current = (self.current + 1) % len(self.servers)
        return server

    def least_connections(self):
        return min(self.servers, key=lambda s: self.connections[s])

    def weighted(self, weights):
        total = sum(weights.values())
        r = random.uniform(0, total)
        upto = 0
        for server, weight in weights.items():
            upto += weight
            if r <= upto:
                return server

Load Balancing Algorithms

Algorithm	Distribution	Use Case
Round Robin	Sequential	Equal-capacity servers
Least Connections	Fewest active requests	Variable request duration
IP Hash	Same client to same server	Session persistence
Weighted Round Robin	Performance-based distribution	Heterogeneous servers
Random	Randomized selection	Large server pools

Health checks are critical — load balancers should remove unresponsive servers from rotation and reintroduce them only after they pass consecutive health checks.

Caching

Caching stores frequently accessed data in high-speed storage to reduce latency and decrease load on primary data sources.

import redis
import json
import time

class CacheAside:
    """Application manages cache — check first, fetch on miss."""
    def __init__(self, redis_client, db_client):
        self.cache = redis_client
        self.db = db_client

    def get_user(self, user_id):
        cache_key = f"user:{user_id}"
        cached = self.cache.get(cache_key)
        if cached:
            return json.loads(cached)

        user = self.db.query("SELECT * FROM users WHERE id = ?", user_id)
        if user:
            self.cache.setex(cache_key, 3600, json.dumps(user))
        return user

    def invalidate(self, user_id):
        self.cache.delete(f"user:{user_id}")

Cache Strategies

Strategy	Behavior	Use Case
Cache-aside	App checks cache, writes on miss	Read-heavy workloads
Write-through	Write to cache and DB synchronously	Read-write balance
Write-behind	Write to cache, async DB write	High write throughput
Read-through	Cache fetches from DB on miss	Simplifies app logic
TTL	Time-based automatic eviction	Stale data is acceptable

Cache eviction policies:

LRU (Least Recently Used) — evicts items not accessed longest
LFU (Least Frequently Used) — evicts items accessed least often
FIFO (First In, First Out) — evicts oldest items regardless of access
TTL (Time To Live) — evicts items after a fixed duration

When caching helps:

Same data read repeatedly (user profiles, product catalogs)
Computationally expensive results (aggregated analytics)
API responses that change infrequently

When caching hurts:

Rapidly changing data (stock prices, live scores)
Data that requires strong consistency
Cache stampedes — many clients requesting the same expired key simultaneously

Content Delivery Network (CDN)

A CDN is a globally distributed network of proxy servers that caches static (and sometimes dynamic) content at edge locations close to users. CDNs reduce latency, offload origin servers, and absorb traffic spikes.

class CDNEdge:
    def __init__(self, origin_url):
        self.origin = origin_url
        self.cache = {}

    def serve(self, path, user_region):
        edge_key = f"{path}:{user_region}"
        if edge_key in self.cache:
            return self.cache[edge_key]

        content = self.fetch_from_origin(path)
        self.cache[edge_key] = content
        return content

    def prefetch(self, paths):
        for path in paths:
            content = self.fetch_from_origin(path)
            for region in self.edge_locations:
                self.cache[f"{path}:{region}"] = content

CDNs are essential for global-scale applications. Cloudflare, AWS CloudFront, and Fastly provide edge caching, DDoS protection, and origin shielding.

Rate Limiting

Rate limiting controls how many requests a client can make within a time window. It prevents abuse, ensures fair usage, and protects backend services from traffic spikes.

import time
from collections import defaultdict

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill = time.time()

    def allow_request(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

class SlidingWindow:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.requests = defaultdict(list)

    def allow(self, client_id):
        now = time.time()
        window_start = now - self.window

        self.requests[client_id] = [
            t for t in self.requests[client_id] if t > window_start
        ]

        if len(self.requests[client_id]) < self.limit:
            self.requests[client_id].append(now)
            return True
        return False

Rate limiting algorithms:

Algorithm	Behavior	Trade-offs
Token bucket	Tokens refill at a fixed rate	Allows bursts up to capacity
Leaky bucket	Requests processed at fixed rate	Smooths traffic, no bursts
Fixed window	Count per clock-aligned window	Traffic spike at window boundary
Sliding window	Count per rolling time window	More accurate, more memory

Rate limiting is typically implemented at the API gateway or load balancer level.

Database Design

SQL vs NoSQL

Dimension	SQL	NoSQL
Schema	Fixed, predefined	Flexible, dynamic
Consistency	Strong (ACID)	Varies (eventual to strong)
Queries	Complex joins, aggregations	Simple key-based lookups
Scaling	Vertical (primary)	Horizontal (sharding)
Best for	Transactions, relationships	High throughput, flexible data

When to Use Each

Choose SQL (PostgreSQL, MySQL) for transactional data, complex relationships, and ACID requirements. Financial systems, inventory management, and user accounts are SQL domains.

Choose NoSQL (Cassandra, DynamoDB, MongoDB) for high write throughput, flexible schemas, and horizontal scaling. Time-series data, session stores, and product catalogs are NoSQL domains.

Database Patterns

Read replicas distribute read traffic across copies of the primary database. All writes go to the primary; reads can go to any replica. This pattern improves read throughput but introduces replication lag.

class ReadReplicaManager:
    def __init__(self, primary, replicas):
        self.primary = primary
        self.replicas = replicas

    def write(self, query, params):
        return self.primary.execute(query, params)

    def read(self, query, params):
        replica = random.choice(self.replicas)
        return replica.execute(query, params)

Sharding splits data across multiple database instances. Each shard holds a subset of data. The shard key must be chosen carefully to avoid hot spots.

import hashlib

class ShardManager:
    def __init__(self, shards):
        self.shards = shards

    def get_shard(self, key):
        shard_id = int(hashlib.md5(key.encode()).hexdigest(), 16) % len(self.shards)
        return self.shards[shard_id]

    def write(self, key, data):
        shard = self.get_shard(key)
        shard.write(key, data)

Sharding strategies:

Range-based: Split by key range (users A-M on shard 1, N-Z on shard 2). Simple but can cause hot spots.
Hash-based: Apply hash function to shard key. Better distribution but breaks range queries.
Directory-based: Lookup table maps keys to shards. Most flexible but adds a lookup hop.

Connection pooling reuses database connections instead of opening a new one per request. It reduces connection overhead and prevents database overload.

from queue import Queue
import psycopg2

class ConnectionPool:
    def __init__(self, conn_string, pool_size=10):
        self.pool = Queue(maxsize=pool_size)
        for _ in range(pool_size):
            conn = psycopg2.connect(conn_string)
            self.pool.put(conn)

    def acquire(self):
        return self.pool.get()

    def release(self, conn):
        self.pool.put(conn)

Consistent Hashing

Consistent hashing distributes keys across nodes with minimal redistribution when nodes join or leave. Each node is assigned positions on a hash ring; each key is assigned to the nearest node clockwise.

import hashlib
import bisect

class ConsistentHashRing:
    def __init__(self, virtual_nodes=150):
        self.ring = {}
        self.sorted_keys = []
        self.virtual_nodes = virtual_nodes

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node_id):
        for i in range(self.virtual_nodes):
            vnode = f"{node_id}:{i}"
            h = self._hash(vnode)
            self.ring[h] = node_id
        self.sorted_keys = sorted(self.ring.keys())

    def remove_node(self, node_id):
        for i in range(self.virtual_nodes):
            vnode = f"{node_id}:{i}"
            h = self._hash(vnode)
            self.ring.pop(h, None)
        self.sorted_keys = sorted(self.ring.keys())

    def get_node(self, key):
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect_right(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

Consistent hashing is used by Amazon DynamoDB, Cassandra, and Discord for data partitioning. Virtual nodes help balance load when physical nodes have different capacities.

Message Queues and Async Processing

Asynchronous processing decouples components and smooths traffic spikes. Instead of waiting for a task to complete, the system places the task into a queue and workers process it independently.

class MessageQueue:
    def __init__(self):
        self.queues = {}
        self.dead_letter = []

    def publish(self, topic, message):
        if topic not in self.queues:
            self.queues[topic] = []
        self.queues[topic].append(message)

    def subscribe(self, topic, worker):
        messages = self.queues.get(topic, [])
        for msg in messages:
            try:
                worker.process(msg)
                self.queues[topic].remove(msg)
            except Exception as e:
                self.dead_letter.append({"topic": topic, "message": msg, "error": str(e)})

When to use async processing:

Email or push notification delivery
Image/video transcoding
Report generation
Payment processing confirmation
Data synchronization between services

Message queue vs pub/sub:

Pattern	Delivery	Use Case
Message queue	One consumer per message	Task distribution
Pub/Sub	All subscribers receive	Event broadcasting

Common tools: RabbitMQ, Apache Kafka, Amazon SQS, Google Pub/Sub, Redis Streams.

API Design Patterns

REST

REST uses standard HTTP methods (GET, POST, PUT, DELETE) and stateless communication. It is simple, cacheable, and widely supported.

from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route("/api/users/<user_id>", methods=["GET"])
def get_user(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    if not user:
        return jsonify({"error": "not found"}), 404
    return jsonify(user)

@app.route("/api/users", methods=["POST"])
def create_user():
    data = request.get_json()
    user_id = db.execute("INSERT INTO users (name, email) VALUES (?, ?)",
                         data["name"], data["email"])
    return jsonify({"id": user_id}), 201

GraphQL

GraphQL lets clients request exactly the data they need. It reduces over-fetching and under-fetching but adds query complexity.

schema = """
type Query {
    user(id: ID!): User
    users(limit: Int): [User]
}
type User {
    id: ID!
    name: String
    orders: [Order]
}
"""

gRPC

gRPC uses Protocol Buffers for efficient binary serialization. It supports streaming, bidirectional communication, and is ideal for high-throughput internal services.

service UserService {
    rpc GetUser (GetUserRequest) returns (User);
    rpc ListUsers (ListUsersRequest) returns (stream User);
}

message GetUserRequest {
    string user_id = 1;
}

When to use each:

Protocol	Strengths	Weaknesses	Best For
REST	Simple, cacheable, ubiquitous	Over-fetching, no real-time	Public APIs
GraphQL	Flexible queries, strong typing	Complex caching, N+1 problem	Complex UIs
gRPC	Fast, streaming, typed	Binary encoding, less browser support	Internal services

Microservices Communication

Synchronous vs Asynchronous

Synchronous (REST/gRPC) — services call each other directly. Simple to implement but creates tight coupling and cascading failure risk.

Asynchronous (message queues/events) — services communicate through events. Loose coupling enables independent scaling and better fault isolation.

import asyncio

class OrderService:
    def __init__(self, event_bus, payment_service, inventory_service):
        self.events = event_bus
        self.payments = payment_service
        self.inventory = inventory_service

    async def place_order(self, order_data):
        order = await self.save_order(order_data)
        self.events.publish("order.created", {
            "order_id": order.id,
            "user_id": order.user_id,
            "total": order.total
        })
        print("Order placed — async processing will handle payment and inventory")

Service Discovery

In dynamic environments where services scale up and down, service discovery lets services find each other without hardcoded addresses.

# Kubernetes service definition
apiVersion: v1
kind: Service
metadata:
  name: user-service
spec:
  selector:
    app: user-service
  ports:
  - port: 80
    targetPort: 8080

Tools: Kubernetes DNS, Consul, ZooKeeper, Netflix Eureka.

Reliability Patterns

Circuit Breaker

A circuit breaker prevents cascading failures by failing fast when a downstream dependency is unhealthy.

from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, success_threshold=3, timeout=30):
        self.failure_count = 0
        self.success_count = 0
        self.failure_threshold = failure_threshold
        self.success_threshold = success_threshold
        self.timeout = timeout
        self.state = CircuitState.CLOSED
        self.last_failure = None

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if self.last_failure and time.time() - self.last_failure > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN — request blocked")

        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

    def on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                self.success_count = 0
        else:
            self.failure_count = 0

    def on_failure(self):
        self.failure_count += 1
        self.last_failure = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

Retry with Exponential Backoff

Transient failures (network timeouts, connection resets) can be handled with retries, but retrying without backoff can cause a retry storm.

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=0.1):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.1)
            time.sleep(delay)

Timeouts

Every external call must have a timeout. Without one, a slow dependency can exhaust connection pools and cascade failures.

import asyncio

async def call_with_timeout(func, timeout_ms=500):
    try:
        return await asyncio.wait_for(func(), timeout=timeout_ms / 1000)
    except asyncio.TimeoutError:
        raise Exception("Request timed out")

Availability Numbers

Uptime %	Downtime per year	Example
99%	3.65 days	Internal tools
99.9%	8.76 hours	Many SaaS products
99.99%	52.56 minutes	Enterprise systems
99.999%	5.26 minutes	Critical infrastructure

System Design Process

Step 1: Requirements Clarification

Define the scope before drawing any boxes. Ask about:

Functional requirements: what features does the system need?
Non-functional requirements: what scale, latency, availability, and consistency?
Constraints: time, budget, team size, existing infrastructure

Step 2: Capacity Estimation

Estimate traffic, storage, bandwidth, and server count. Use back-of-the-envelope calculations to validate assumptions before designing.

Step 3: High-Level Design

Draw the system architecture: clients, load balancers, API gateways, application servers, databases, caches, message queues, and CDNs. Explain data flow end to end.

Step 4: Deep Dive

Focus on one or two components. Discuss database schema, caching strategy, sharding approach, and consistency tradeoffs. Justify every decision.

Step 5: Bottlenecks and Scale

Identify single points of failure. Discuss how the system handles 10x the current load. Address data replication, partition tolerance, and disaster recovery.

Step 6: Wrap Up

Summarize the design, restate tradeoffs, and suggest future improvements.

Common System Designs

URL Shortener

flowchart LR
    Client --> LB[Load Balancer]
    LB --> API[API Server]
    API --> Cache[Redis Cache]
    API --> DB[(Database)]
    API --> MQ[Message Queue]
    MQ --> Analytics[Analytics Worker]

Components:

Hash function (Base62 encoding of unique ID) generates short codes
Database stores mapping of short code to long URL
Cache serves popular URLs — 95%+ cache hit rate expected
301 redirect for permanent URLs, 302 for analytics tracking
Analytics worker processes click data asynchronously

Capacity: 100M daily active users, 10M new URLs/day, 1B redirects/day. Database sharded by URL hash. Cache with 80% hit rate reduces DB load by 5x.

News Feed (Twitter-style)

flowchart LR
    Client --> LB[Load Balancer]
    LB --> API[API Server]
    API --> Cache[Feed Cache]
    API --> Graph[(Social Graph DB)]
    API --> Timeline[(Timeline DB)]
    API --> MQ[Fanout Queue]
    MQ --> Worker[Fanout Worker]

Two approaches:

Pull — Generate feed on read. When a user opens the app, fetch recent posts from followed users and merge. Works well for users who follow few accounts or check infrequently.

Push — Pre-compute feed on write. When a user posts, fan out the post to all followers’ timelines. Works well for users with few followers but fan-out to celebrities with millions of followers requires a hybrid approach.

Fanout service:

class FanoutService:
    def fanout_to_followers(self, post, author_id, followers):
        for follower in followers:
            timeline_key = f"timeline:{follower}"
            self.cache.lpush(timeline_key, post)
            self.cache.ltrim(timeline_key, 0, 500)

Cache performs better when feeds are pre-computed (fanout on write), but for users with millions of followers, fanout on read avoids the thundering herd problem of writing to millions of timeline caches simultaneously.

Modern Trends (2025-2026)

AI Infrastructure Design

LLM serving, embedding pipelines, and RAG architectures are now common system design topics. Key considerations:

GPU resource management — batch inference, model parallelism, quantization
Vector databases — embedding storage and similarity search (Pinecone, Weaviate, Milvus)
RAG pipelines — indexing, chunking, retrieval, and augmentation
Cost bounding — model inference often dwarfs other infrastructure costs

An architecture for an LLM-powered chatbot at scale:

flowchart LR
    Client --> LB[Load Balancer]
    LB --> Router[Request Router]
    Router --> Cache[Response Cache]
    Router --> RAG[RAG Pipeline]
    RAG --> VectorDB[(Vector DB)]
    RAG --> LLM[Inference Cluster]
    LLM --> GPU[GPU Workers]

Object Storage as the Database

Systems like Amazon Aurora, Snowflake, and WarpStream use object storage (S3, Azure Blob) as a core architectural component. Compute is separated from storage, enabling independent scaling of each layer.

Decomposing Databases

Databases are disaggregating into separate engines for parsing, optimization, execution, and storage. This decomposition allows each component to scale independently and be optimized for specific workloads.

Performance Optimization Checklist

Cache frequently accessed data at multiple layers (CDN, app, database)
Use read replicas for read-heavy workloads
Add connection pooling to manage database connections
Set timeouts on every external call (start with 500ms)
Implement circuit breakers for downstream dependencies
Use exponential backoff with jitter for retries
Optimize slow queries with proper indexing
Use async processing for non-critical operations
Compress responses (gzip/brotli)
Monitor latency distributions, not just averages

Introduction

Scalability Fundamentals

Horizontal vs Vertical Scaling

Back-of-the-Envelope Estimation

Load Balancing

Load Balancing Algorithms

Caching

Cache Strategies

Content Delivery Network (CDN)

Rate Limiting

Database Design

SQL vs NoSQL

When to Use Each

Database Patterns

Consistent Hashing

Message Queues and Async Processing

API Design Patterns

REST

GraphQL

gRPC

Microservices Communication

Synchronous vs Asynchronous

Service Discovery

Reliability Patterns

Circuit Breaker

Retry with Exponential Backoff

Timeouts

Availability Numbers

System Design Process

Step 1: Requirements Clarification

Step 2: Capacity Estimation

Step 3: High-Level Design

Step 4: Deep Dive

Step 5: Bottlenecks and Scale

Step 6: Wrap Up

Common System Designs

URL Shortener

News Feed (Twitter-style)

Modern Trends (2025-2026)

AI Infrastructure Design

Object Storage as the Database

Decomposing Databases

Performance Optimization Checklist

Resources

Related Articles

Comments

Share this article

👍 Was this article helpful?