Skip to main content

System Design Interview Guide: Complete Patterns and Approaches

Published: February 26, 2026 Updated: May 24, 2026 Larry Qu 14 min read

System design questions are the single most important factor in leveling and compensation at big tech companies. The system design round is where offers are won or lost — coding rounds verify you can write correct code, but system design tests whether you can think like an engineer who builds real products at scale.

In 2026, the landscape has shifted dramatically. GenAI system design has emerged as a standalone interview category. Companies now evaluate cost-aware architecture, observability, and operational maturity alongside traditional scalability. In-person rounds rose from 24% in 2022 to 38% in 2025, driven by AI cheating concerns. This guide covers the fundamental patterns, company-specific formats, and updated expectations for today’s interviews.

What is System Design?

System design involves making architectural decisions about how software systems should be built. It encompasses:

  • Functional requirements: What the system should do
  • Non-functional requirements: Performance, scalability, reliability, cost
  • Technical constraints: Budget, timeline, existing infrastructure, team size

The interviewer evaluates four core dimensions: problem navigation (breaking a vague prompt into manageable pieces), solution design (applying caching, sharding, load balancing, and queues coherently), trade-off reasoning (explicitly stating alternatives and defending choices), and communication (thinking out loud, treating the interviewer as a design partner).

Common System Design Concepts

CAP Theorem

The CAP Theorem states that a distributed system can provide only two of three guarantees — consistency, availability, and partition tolerance — simultaneously. Network partitions are unavoidable, so the real choice is between CP (consistency over availability) and AP (availability over consistency).

graph TD
    subgraph "CAP Theorem"
        C[Consistency<br/>All nodes see same data]
        A[Availability<br/>Every request gets a response]
        P[Partition Tolerance<br/>System works despite network failures]
        C -- CP Systems --> P
        A -- AP Systems --> P
    end

CP systems (HBase, MongoDB with write concern majority) prioritize consistency. During a partition, they may reject writes to ensure all nodes agree. AP systems (Cassandra, DynamoDB) prioritize availability. During a partition, they accept writes and resolve conflicts later via last-write-wins or CRDTs.

ACID vs BASE

ACID BASE
Atomicity Basically Available
Consistency Soft state
Isolation Eventual consistency
Durability

Choose ACID (PostgreSQL, MySQL) for financial transactions, inventory management, and systems where correctness is non-negotiable. Choose BASE (Cassandra, DynamoDB) for high-throughput systems where availability matters more than immediate consistency.

Consistency Patterns

  • Strong consistency: All reads return the latest write. Required for financial ledgers and booking systems. Cost: higher latency during replication.
  • Eventual consistency: Reads may return stale data but will converge over time. Used in DNS, social feeds, and content delivery. Cost: complex conflict resolution.
  • Causal consistency: Related events appear in the correct order. Used in collaborative editing and social media comments.

The 2026 Interview Landscape

GenAI System Design: The New Standalone Category

AI and LLM-related interview questions have tripled since 2023. Companies now ask:

  • “Design a retrieval-augmented chatbot for enterprise search”
  • “Design an AI coding assistant”
  • “Design an LLM-powered document search system”
  • “Design a model serving platform for 10k requests/second”

These questions test RAG pipeline design, model routing (using cheaper models for simple queries), prompt architecture, token cost optimization, and safety concerns like prompt injection prevention.

Seven Concepts Gaining Prominence

  1. Cost-aware architecture — Discuss budget implications of every design decision, not just scalability
  2. Observability — SLIs/SLOs, distributed tracing (OpenTelemetry), and alerting runbooks are expected
  3. Security and privacy by design — Authentication, authorization, PII segregation, GDPR compliance
  4. Event-driven architecture — Event sourcing, CQRS, and asynchronous communication patterns
  5. Multi-region active-active — Geo-distributed systems with failover and conflict resolution
  6. Resilience patterns — Circuit breakers, bulkheads, retries with backoff, chaos engineering
  7. Vector databases — Embedding storage and similarity search for RAG systems (Pinecone, Weaviate, Milvus)

Updated Hardware Benchmarks

Modern servers are significantly more powerful than what older prep materials assume. Using outdated numbers signals that you have not worked with production systems recently.

Operation Modern (2025-2026)
L1 cache reference 0.5 ns
Main memory reference 100 ns
SSD random read 16 µs
Network round trip (same DC) 500 µs
Network round trip (cross-region) 50-100 ms
Typical server throughput 1,000-10,000 req/s per core

Company-Specific Interview Formats

Each company structures system design interviews differently. Understanding these nuances gives you a strategic advantage.

Google

Format: 45-60 minutes, 1 round at L4-L5, up to 3 rounds for senior roles (L6+). The interview flows through five phases: problem statement (~2 min), requirements gathering (~5 min), high-level design (~15 min), deep dive into 1-2 components (~15-20 min), and scalability/bottleneck discussion (~5-10 min).

What they look for: Google-scale thinking from the start (billions of users, geo-distributed infrastructure). Questions often mirror Google products (YouTube, Maps, Drive, Search). For SRE roles, Google now uses NALSD (Non-Abstract Large System Design), where candidates scale an existing system instead of designing from scratch.

2025-2026 changes: Return to in-person interviews at major engineering sites (Bay Area, Seattle, NYC, Bangalore). New Google Hiring Assessment pre-screening tool. GenAI questions are appearing with increasing frequency.

Amazon

Format: 45-60 minutes, 1-2 system design rounds during “The Loop” alongside coding, behavioral, and Bar Raiser interviews. Amazon’s 16 Leadership Principles permeate every evaluation.

What they look for: Frame design decisions using LP language — “Customer Obsession” to justify UX choices, “Operational Excellence” for monitoring, “Think Big” for scalability plans. Expect deep dives into failure modes, retries, idempotency, and monitoring.

2025-2026 changes: Custom anti-cheating question variants ensure every candidate gets a unique prompt. Expect more AWS-specific discussions (DynamoDB, S3, SQS).

Meta

Format: 45 minutes (roughly 35 minutes of actual design). Infrastructure engineers get a “System Design” interview; product engineers get a “Product Architecture” interview including UX considerations. E4-E5: 1 round; E6+ (Staff): 2 mandatory rounds — failing either blocks the hire.

What they look for: Product sense, ML-aware design, caching layers (Memcache, TAO), and ranking algorithms. Meta’s “Pirate X” loop specifically tests API and product design thinking.

2025-2026 changes: AI-assisted coding rounds where candidates use an integrated AI assistant in CoderPad. Full screen-sharing with background blur disabled for cheating detection.

Netflix

Format: Unique, one-off questions for every candidate instead of recycling standard problems. Frequently conducted without any shared drawing tool — you must walk through the design using only words.

What they look for: Practical experience grounded in past work, trade-off reasoning, operational thinking. Netflix cares about availability over consistency and testing in production.

Seven Must-Know Design Patterns

System design reduces to a set of recurring patterns. Mastering these patterns lets you recognize the problem structure quickly and apply the right solution.

1. Scaling Reads

Read traffic is often the first bottleneck. Social feeds, product catalogs, and content platforms all have read-to-write ratios exceeding 100:1.

class ReadScalingStrategy:
    def __init__(self, primary_db, replicas, cache_client):
        self.primary = primary_db
        self.replicas = replicas
        self.cache = cache_client

    def get_user_profile(self, user_id):
        cached = self.cache.get(f"user:{user_id}")
        if cached:
            return cached
        replica = self.replicas[hash(user_id) % len(self.replicas)]
        profile = replica.query("SELECT * FROM users WHERE id = ?", user_id)
        self.cache.setex(f"user:{user_id}", 3600, profile)
        return profile

Progression: Indexes and query tuning → Read replicas → Application-level caching → CDN. Each step adds capacity but introduces trade-offs (replication lag, cache invalidation, stale data).

2. Scaling Writes

Scaling writes is harder because every write must land in the correct place and coordination is complex.

class ShardManager:
    def __init__(self, shards):
        self.shards = shards

    def get_shard(self, key):
        shard_id = hash(key) % len(self.shards)
        return self.shards[shard_id]

    def write_post(self, post):
        shard = self.get_shard(post["user_id"])
        shard.write(post)

Approaches: Sharding splits data across servers by a key (user ID, geographic region). Partitioning separates data by type or feature. The key challenge is picking a shard key that balances load — user IDs work for social feeds, but product categories fail for e-commerce because some categories dominate. For write bursts, use buffering with queues and consider shedding load instead of crashing the system.

3. Real-time Updates

Many systems need to push updates to users — notifications, chat messages, dashboards, or collaborative editing.

graph LR
    subgraph "Progression of Real-time Mechanisms"
        Polling[HTTP Polling] --> SSE[Server-Sent Events]
        SSE --> WebSocket[WebSockets]
    end

Polling is the simplest option but inefficient. Server-sent events work for one-way updates from server to client. WebSockets handle bidirectional communication. On the backend, pub/sub systems (Redis, Kafka) work for lightweight updates. Collaborative editing requires stateful servers and consistent hashing to keep related users on the same machine.

4. Long-Running Tasks

Operations like video encoding, report generation, or bulk data processing cannot run synchronously.

import uuid
from queue import Queue

class TaskScheduler:
    def __init__(self):
        self.queue = Queue()
        self.results = {}

    def submit_task(self, task_fn, *args):
        task_id = str(uuid.uuid4())
        self.queue.put((task_id, task_fn, args))
        return task_id

    def get_status(self, task_id):
        return self.results.get(task_id, {"status": "processing"})

Pattern: Accept the request, place the job in a queue, return a job ID. Workers process the job while the user polls status or receives a callback. Choose queue technology based on needs: Redis (simple), SQS (retries, dead-letter queues), Kafka (replay capabilities), Temporal or Step Functions (multi-step workflows).

5. Dealing with Contention

When multiple users compete for the same resource — limited tickets, auction bids, inventory — you need coordination.

from contextlib import contextmanager

class InventoryManager:
    def __init__(self, db):
        self.db = db

    @contextmanager
    def pessimistic_lock(self, item_id):
        self.db.execute("BEGIN")
        self.db.execute("SELECT quantity FROM inventory WHERE id = ? FOR UPDATE", item_id)
        yield
        self.db.execute("COMMIT")

    def reserve_ticket(self, event_id, user_id):
        with self.pessimistic_lock(event_id):
            row = self.db.query("SELECT quantity FROM inventory WHERE id = ?", event_id)
            if row.quantity > 0:
                self.db.execute("UPDATE inventory SET quantity = quantity - 1 WHERE id = ?", event_id)
                self.db.execute("INSERT INTO reservations (event_id, user_id) VALUES (?, ?)", event_id, user_id)
                return True
        return False

Within a single database, transactions and locks work. Across distributed systems, you may need distributed locks (Redis Redlock, ZooKeeper) or two-phase commits. For high contention scenarios, batch requests and process them in waves rather than updating in real time.

6. Large Blobs

Images, videos, and large documents cannot pass through application servers without overwhelming bandwidth.

import boto3
from datetime import datetime, timedelta

class BlobStorage:
    def __init__(self):
        self.s3 = boto3.client("s3")

    def generate_upload_url(self, file_name, content_type):
        return self.s3.generate_presigned_url(
            "put_object",
            Params={
                "Bucket": "media-bucket",
                "Key": f"uploads/{file_name}",
                "ContentType": content_type
            },
            ExpiresIn=3600
        )

    def generate_download_url(self, object_key):
        return self.s3.generate_presigned_url(
            "get_object",
            Params={"Bucket": "media-bucket", "Key": object_key},
            ExpiresIn=3600
        )

Pattern: Presigned URLs let clients upload directly to storage (S3, GCS). Downloads go through a CDN (CloudFront, Cloudflare). For consistency, keep metadata in sync with blob storage and support resumable uploads for large files.

7. Multi-Step Processes

Payment processing, order fulfillment, and user onboarding span multiple services where each step may fail.

class SagaOrchestrator:
    def __init__(self):
        self.steps = []
        self.compensations = []

    def add_step(self, action, compensation):
        self.steps.append(action)
        self.compensations.append(compensation)

    def execute(self, *args):
        completed = []
        for i, step in enumerate(self.steps):
            try:
                step(*args)
                completed.append(i)
            except Exception:
                for j in reversed(completed):
                    self.compensations[j](*args)
                raise

Simple workflows use database transactions. Complex workflows need the saga pattern, event sourcing, or workflow engines (Temporal, Step Functions). These provide retries, timeouts, and state management but add operational overhead.

Building Blocks

Load Balancing

import random

class LoadBalancer:
    def __init__(self, servers, weights=None):
        self.servers = servers
        self.weights = weights or {s: 1 for s in servers}
        self.current = 0

    def round_robin(self):
        server = self.servers[self.current]
        self.current = (self.current + 1) % len(self.servers)
        return server

    def weighted_random(self):
        total = sum(self.weights.values())
        r = random.uniform(0, total)
        upto = 0
        for server, weight in self.weights.items():
            upto += weight
            if r <= upto:
                return server

Algorithms: Round Robin (equal-capacity servers), Least Connections (variable request duration), IP Hash (session persistence), Weighted Random (heterogeneous servers).

Caching Strategies

Strategy Behavior Use Case
Cache-aside App checks cache, writes on miss Read-heavy workloads
Write-through Write to cache and DB synchronously Read-write balance
Write-behind Write to cache, async DB write High write throughput
Read-through Cache fetches from DB on miss Simplifies app logic

Eviction policies: LRU (least recently used), LFU (least frequently used), FIFO (first in, first out), TTL (time-based).

Rate Limiting

import time
from collections import defaultdict

class SlidingWindowRateLimiter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.requests = defaultdict(list)

    def allow(self, client_id):
        now = time.time()
        window_start = now - self.window
        self.requests[client_id] = [t for t in self.requests[client_id] if t > window_start]
        if len(self.requests[client_id]) < self.limit:
            self.requests[client_id].append(now)
            return True
        return False

Algorithms: Token bucket (allows bursts), Leaky bucket (smooths traffic), Fixed window (simple but spike-prone at boundaries), Sliding window (more accurate, more memory).

Step-by-Step Design Approach

Step 1: Requirements Clarification

Spend the first 3-5 minutes asking questions. This is a primary evaluation signal — skipping it is the fastest way to fail.

  • Who are the users? How many?
  • What are the key features?
  • What are the read/write ratios?
  • What are the latency and availability requirements?
  • Any geographic or compliance considerations?

Step 2: Capacity Estimation

Quick math to validate architectural decisions. Use round numbers and keep the arithmetic simple. Add 30% headroom for traffic spikes and maintenance.

def estimate_traffic(daily_active_users, requests_per_user):
    daily_requests = daily_active_users * requests_per_user
    average_qps = daily_requests / 86400
    peak_qps = average_qps * 3
    return {"average_qps": average_qps, "peak_qps": peak_qps}

estimates = estimate_traffic(daily_active_users=100_000_000, requests_per_user=10)
print(estimates)

Step 3: High-Level Design

Draw the system architecture: clients, load balancers, API gateways, application servers, databases, caches, message queues, and CDNs. Explain the data flow end to end.

graph LR
    Client --> LB[Load Balancer]
    LB --> API[API Servers]
    API --> Cache[Redis Cache]
    API --> DB[(Database)]
    API --> MQ[Message Queue]
    MQ --> Workers[Background Workers]
    CDN[CDN] --> Client

Step 4: Deep Dive

Focus on 1-2 components. Discuss database schema, caching strategy, sharding approach, and consistency trade-offs. Justify every decision with explicit alternatives.

Step 5: Identify Bottlenecks

Find single points of failure. Discuss how the system handles 10x the current load. Address data replication, partition tolerance, and disaster recovery.

Step 6: Wrap Up

Summarize the design, restate trade-offs, and suggest future improvements. Mention monitoring, observability, and operational concerns — these signal production experience.

Top 15 Questions by Frequency

Based on 853 system design questions across FAANG companies, these are the most common problems:

Rank Question Key Concepts Tested
1 URL Shortener Hashing, database sharding, caching, redirects
2 Social Media Feed Fan-out strategies, ranking, message queues
3 Messaging System WebSockets, message delivery guarantees, encryption
4 Video Streaming (YouTube) Transcoding, CDN, adaptive streaming (HLS/DASH)
5 Photo Sharing (Instagram) Object storage, CDN, feed generation
6 Collaborative Editing (Google Docs) CRDTs/OT, WebSockets, versioning
7 Web Crawler BFS/DFS, URL frontier, bloom filters, politeness
8 Rate Limiter Token bucket, sliding window, distributed counters
9 Cloud File Storage (Drive/Dropbox) Chunk storage, deduplication, sync
10 Autocomplete/Typeahead Trie, ranking, in-memory cache, <100ms latency
11 Ride-Sharing (Uber) Geohashing, QuadTrees, real-time matching
12 Distributed Cache Consistent hashing, eviction policies, replication
13 Event Booking (Ticketmaster) Concurrency control, queueing, idempotency
14 Notification System Multi-channel delivery, priority queues, retries
15 Distributed Task Scheduler Job queuing, worker pools, exactly-once processing

Tiered difficulty: Beginners should master tiers 1-10 (most common across levels). Mid-level candidates need tiers 1-13. Senior and staff candidates must handle all 15 plus GenAI system design questions.

GenAI System Design

RAG Pipeline Architecture

Retrieval-Augmented Generation (RAG) combines search with generation — retrieve relevant chunks from a vector store, then feed them as context into an LLM prompt.

graph LR
    Query[User Query] --> Encoder[Query Encoder]
    Encoder --> VectorDB[(Vector Database)]
    VectorDB --> Retrieve[Top-K Retrieval]
    Retrieve --> Prompt[Context Assembly]
    Prompt --> LLM[LLM Inference]
    LLM --> Validate[Output Validator]
    Validate --> Response[Final Response]
    Feedback[User Feedback] --> Retrain[Retrieval Tuning]

Key design decisions:

  • Chunking strategy: Semantic segmentation of documents into retrievable chunks (256-512 tokens)
  • Embedding model: SentenceTransformers, text-embedding-3-small, or BGE embeddings
  • Vector database: Pinecone, Weaviate, Milvus, or pgvector for embedding storage
  • Model routing: Use smaller/distilled models for routine queries, larger LLMs for complex reasoning
  • Caching: Multi-layer caching (retrieval, prompt, and response) to reduce cost and latency
  • Safety: Content filters, prompt injection prevention, confidence estimation with fallback

LLM Serving Infrastructure

class InferenceRouter:
    def __init__(self, small_model, large_model):
        self.small = small_model
        self.large = large_model
        self.cache = {}

    def route(self, prompt, complexity):
        cache_key = hash(prompt)
        if cache_key in self.cache:
            return self.cache[cache_key]
        if complexity == "simple":
            response = self.small.generate(prompt)
        else:
            response = self.large.generate(prompt)
        self.cache[cache_key] = response
        return response

Considerations: GPU resource management (batch inference, quantization, model parallelism), token cost optimization (prompt compression, caching, model tiering), and monitoring (per-token cost, latency distributions, hallucination rates).

Common Mistakes

Jumping into design without clarifying requirements is the most frequently cited reason for rejection. Spend the first 5 minutes gathering requirements. Other common mistakes:

  • Skipping back-of-the-envelope calculations signals you cannot reason about scale
  • Ignoring trade-offs — do not just pick a technology, explain why it beats alternatives
  • Happy-path-only design — always discuss failover, replication, retries, and circuit breakers
  • Over-engineering — start simple and add complexity only when requirements demand it
  • Poor time management — spending 15 minutes on database schema leaves no time for caching or scaling
  • Designing in silence — narrate your thinking so the interviewer can assess your reasoning
  • Outdated hardware numbers — using old server benchmarks signals unfamiliarity with modern production systems

Conclusion

System design interviews evaluate your ability to make reasoned architectural trade-offs under uncertainty. The candidates who succeed are not the ones who memorize the most architectures — they are the ones who recognize patterns, reason through trade-offs explicitly, and communicate clearly.

Structure your approach: clarify requirements, estimate scale, design the data model, outline the core flow, then discuss bottlenecks and trade-offs. In 2026, add GenAI system design familiarity, cost-aware thinking, and operational maturity to your preparation. The thought process matters more than the final design.

Resources

Comments

👍 Was this article helpful?