System Design Interview Guide: Complete Patterns and Approaches

System design questions are the single most important factor in leveling and compensation at big tech companies. The system design round is where offers are won or lost — coding rounds verify you can write correct code, but system design tests whether you can think like an engineer who builds real products at scale.

In 2026, the landscape has shifted dramatically. GenAI system design has emerged as a standalone interview category. Companies now evaluate cost-aware architecture, observability, and operational maturity alongside traditional scalability. In-person rounds rose from 24% in 2022 to 38% in 2025, driven by AI cheating concerns. This guide covers the fundamental patterns, company-specific formats, and updated expectations for today’s interviews.

What is System Design?

System design involves making architectural decisions about how software systems should be built. It encompasses:

Functional requirements: What the system should do
Non-functional requirements: Performance, scalability, reliability, cost
Technical constraints: Budget, timeline, existing infrastructure, team size

The interviewer evaluates four core dimensions: problem navigation (breaking a vague prompt into manageable pieces), solution design (applying caching, sharding, load balancing, and queues coherently), trade-off reasoning (explicitly stating alternatives and defending choices), and communication (thinking out loud, treating the interviewer as a design partner).

Common System Design Concepts

CAP Theorem

The CAP Theorem states that a distributed system can provide only two of three guarantees — consistency, availability, and partition tolerance — simultaneously. Network partitions are unavoidable, so the real choice is between CP (consistency over availability) and AP (availability over consistency).

graph TD
    subgraph "CAP Theorem"
        C[Consistency<br/>All nodes see same data]
        A[Availability<br/>Every request gets a response]
        P[Partition Tolerance<br/>System works despite network failures]
        C -- CP Systems --> P
        A -- AP Systems --> P
    end

CP systems (HBase, MongoDB with write concern majority) prioritize consistency. During a partition, they may reject writes to ensure all nodes agree. AP systems (Cassandra, DynamoDB) prioritize availability. During a partition, they accept writes and resolve conflicts later via last-write-wins or CRDTs.

ACID vs BASE

ACID	BASE
Atomicity	Basically Available
Consistency	Soft state
Isolation	Eventual consistency
Durability	—

Choose ACID (PostgreSQL, MySQL) for financial transactions, inventory management, and systems where correctness is non-negotiable. Choose BASE (Cassandra, DynamoDB) for high-throughput systems where availability matters more than immediate consistency.

Consistency Patterns

Strong consistency: All reads return the latest write. Required for financial ledgers and booking systems. Cost: higher latency during replication.
Eventual consistency: Reads may return stale data but will converge over time. Used in DNS, social feeds, and content delivery. Cost: complex conflict resolution.
Causal consistency: Related events appear in the correct order. Used in collaborative editing and social media comments.

The 2026 Interview Landscape

GenAI System Design: The New Standalone Category

AI and LLM-related interview questions have tripled since 2023. Companies now ask:

“Design a retrieval-augmented chatbot for enterprise search”
“Design an AI coding assistant”
“Design an LLM-powered document search system”
“Design a model serving platform for 10k requests/second”

These questions test RAG pipeline design, model routing (using cheaper models for simple queries), prompt architecture, token cost optimization, and safety concerns like prompt injection prevention.

Seven Concepts Gaining Prominence

Cost-aware architecture — Discuss budget implications of every design decision, not just scalability
Observability — SLIs/SLOs, distributed tracing (OpenTelemetry), and alerting runbooks are expected
Security and privacy by design — Authentication, authorization, PII segregation, GDPR compliance
Event-driven architecture — Event sourcing, CQRS, and asynchronous communication patterns
Multi-region active-active — Geo-distributed systems with failover and conflict resolution
Resilience patterns — Circuit breakers, bulkheads, retries with backoff, chaos engineering
Vector databases — Embedding storage and similarity search for RAG systems (Pinecone, Weaviate, Milvus)

Updated Hardware Benchmarks

Modern servers are significantly more powerful than what older prep materials assume. Using outdated numbers signals that you have not worked with production systems recently.

Operation	Modern (2025-2026)
L1 cache reference	0.5 ns
Main memory reference	100 ns
SSD random read	16 µs
Network round trip (same DC)	500 µs
Network round trip (cross-region)	50-100 ms
Typical server throughput	1,000-10,000 req/s per core

Company-Specific Interview Formats

Each company structures system design interviews differently. Understanding these nuances gives you a strategic advantage.

Google

Format: 45-60 minutes, 1 round at L4-L5, up to 3 rounds for senior roles (L6+). The interview flows through five phases: problem statement (~2 min), requirements gathering (~5 min), high-level design (~15 min), deep dive into 1-2 components (~15-20 min), and scalability/bottleneck discussion (~5-10 min).

What they look for: Google-scale thinking from the start (billions of users, geo-distributed infrastructure). Questions often mirror Google products (YouTube, Maps, Drive, Search). For SRE roles, Google now uses NALSD (Non-Abstract Large System Design), where candidates scale an existing system instead of designing from scratch.

2025-2026 changes: Return to in-person interviews at major engineering sites (Bay Area, Seattle, NYC, Bangalore). New Google Hiring Assessment pre-screening tool. GenAI questions are appearing with increasing frequency.

Amazon

Format: 45-60 minutes, 1-2 system design rounds during “The Loop” alongside coding, behavioral, and Bar Raiser interviews. Amazon’s 16 Leadership Principles permeate every evaluation.

What they look for: Frame design decisions using LP language — “Customer Obsession” to justify UX choices, “Operational Excellence” for monitoring, “Think Big” for scalability plans. Expect deep dives into failure modes, retries, idempotency, and monitoring.

2025-2026 changes: Custom anti-cheating question variants ensure every candidate gets a unique prompt. Expect more AWS-specific discussions (DynamoDB, S3, SQS).

Netflix

Format: Unique, one-off questions for every candidate instead of recycling standard problems. Frequently conducted without any shared drawing tool — you must walk through the design using only words.

What they look for: Practical experience grounded in past work, trade-off reasoning, operational thinking. Netflix cares about availability over consistency and testing in production.

Seven Must-Know Design Patterns

System design reduces to a set of recurring patterns. Mastering these patterns lets you recognize the problem structure quickly and apply the right solution.

1. Scaling Reads

Read traffic is often the first bottleneck. Social feeds, product catalogs, and content platforms all have read-to-write ratios exceeding 100:1.

class ReadScalingStrategy:
    def __init__(self, primary_db, replicas, cache_client):
        self.primary = primary_db
        self.replicas = replicas
        self.cache = cache_client

    def get_user_profile(self, user_id):
        cached = self.cache.get(f"user:{user_id}")
        if cached:
            return cached
        replica = self.replicas[hash(user_id) % len(self.replicas)]
        profile = replica.query("SELECT * FROM users WHERE id = ?", user_id)
        self.cache.setex(f"user:{user_id}", 3600, profile)
        return profile

Progression: Indexes and query tuning → Read replicas → Application-level caching → CDN. Each step adds capacity but introduces trade-offs (replication lag, cache invalidation, stale data).

2. Scaling Writes

Scaling writes is harder because every write must land in the correct place and coordination is complex.

class ShardManager:
    def __init__(self, shards):
        self.shards = shards

    def get_shard(self, key):
        shard_id = hash(key) % len(self.shards)
        return self.shards[shard_id]

    def write_post(self, post):
        shard = self.get_shard(post["user_id"])
        shard.write(post)

Approaches: Sharding splits data across servers by a key (user ID, geographic region). Partitioning separates data by type or feature. The key challenge is picking a shard key that balances load — user IDs work for social feeds, but product categories fail for e-commerce because some categories dominate. For write bursts, use buffering with queues and consider shedding load instead of crashing the system.

3. Real-time Updates

Many systems need to push updates to users — notifications, chat messages, dashboards, or collaborative editing.

graph LR
    subgraph "Progression of Real-time Mechanisms"
        Polling[HTTP Polling] --> SSE[Server-Sent Events]
        SSE --> WebSocket[WebSockets]
    end

Polling is the simplest option but inefficient. Server-sent events work for one-way updates from server to client. WebSockets handle bidirectional communication. On the backend, pub/sub systems (Redis, Kafka) work for lightweight updates. Collaborative editing requires stateful servers and consistent hashing to keep related users on the same machine.

4. Long-Running Tasks

Operations like video encoding, report generation, or bulk data processing cannot run synchronously.

import uuid
from queue import Queue

class TaskScheduler:
    def __init__(self):
        self.queue = Queue()
        self.results = {}

    def submit_task(self, task_fn, *args):
        task_id = str(uuid.uuid4())
        self.queue.put((task_id, task_fn, args))
        return task_id

    def get_status(self, task_id):
        return self.results.get(task_id, {"status": "processing"})

Pattern: Accept the request, place the job in a queue, return a job ID. Workers process the job while the user polls status or receives a callback. Choose queue technology based on needs: Redis (simple), SQS (retries, dead-letter queues), Kafka (replay capabilities), Temporal or Step Functions (multi-step workflows).

5. Dealing with Contention

When multiple users compete for the same resource — limited tickets, auction bids, inventory — you need coordination.

from contextlib import contextmanager

class InventoryManager:
    def __init__(self, db):
        self.db = db

    @contextmanager
    def pessimistic_lock(self, item_id):
        self.db.execute("BEGIN")
        self.db.execute("SELECT quantity FROM inventory WHERE id = ? FOR UPDATE", item_id)
        yield
        self.db.execute("COMMIT")

    def reserve_ticket(self, event_id, user_id):
        with self.pessimistic_lock(event_id):
            row = self.db.query("SELECT quantity FROM inventory WHERE id = ?", event_id)
            if row.quantity > 0:
                self.db.execute("UPDATE inventory SET quantity = quantity - 1 WHERE id = ?", event_id)
                self.db.execute("INSERT INTO reservations (event_id, user_id) VALUES (?, ?)", event_id, user_id)
                return True
        return False

Within a single database, transactions and locks work. Across distributed systems, you may need distributed locks (Redis Redlock, ZooKeeper) or two-phase commits. For high contention scenarios, batch requests and process them in waves rather than updating in real time.

6. Large Blobs

Images, videos, and large documents cannot pass through application servers without overwhelming bandwidth.

import boto3
from datetime import datetime, timedelta

class BlobStorage:
    def __init__(self):
        self.s3 = boto3.client("s3")

    def generate_upload_url(self, file_name, content_type):
        return self.s3.generate_presigned_url(
            "put_object",
            Params={
                "Bucket": "media-bucket",
                "Key": f"uploads/{file_name}",
                "ContentType": content_type
            },
            ExpiresIn=3600
        )

    def generate_download_url(self, object_key):
        return self.s3.generate_presigned_url(
            "get_object",
            Params={"Bucket": "media-bucket", "Key": object_key},
            ExpiresIn=3600
        )

Pattern: Presigned URLs let clients upload directly to storage (S3, GCS). Downloads go through a CDN (CloudFront, Cloudflare). For consistency, keep metadata in sync with blob storage and support resumable uploads for large files.

7. Multi-Step Processes

Payment processing, order fulfillment, and user onboarding span multiple services where each step may fail.

class SagaOrchestrator:
    def __init__(self):
        self.steps = []
        self.compensations = []

    def add_step(self, action, compensation):
        self.steps.append(action)
        self.compensations.append(compensation)

    def execute(self, *args):
        completed = []
        for i, step in enumerate(self.steps):
            try:
                step(*args)
                completed.append(i)
            except Exception:
                for j in reversed(completed):
                    self.compensations[j](*args)
                raise

Simple workflows use database transactions. Complex workflows need the saga pattern, event sourcing, or workflow engines (Temporal, Step Functions). These provide retries, timeouts, and state management but add operational overhead.

Building Blocks

Load Balancing

import random

class LoadBalancer:
    def __init__(self, servers, weights=None):
        self.servers = servers
        self.weights = weights or {s: 1 for s in servers}
        self.current = 0

    def round_robin(self):
        server = self.servers[self.current]
        self.current = (self.current + 1) % len(self.servers)
        return server

    def weighted_random(self):
        total = sum(self.weights.values())
        r = random.uniform(0, total)
        upto = 0
        for server, weight in self.weights.items():
            upto += weight
            if r <= upto:
                return server

Algorithms: Round Robin (equal-capacity servers), Least Connections (variable request duration), IP Hash (session persistence), Weighted Random (heterogeneous servers).

Caching Strategies

Strategy	Behavior	Use Case
Cache-aside	App checks cache, writes on miss	Read-heavy workloads
Write-through	Write to cache and DB synchronously	Read-write balance
Write-behind	Write to cache, async DB write	High write throughput
Read-through	Cache fetches from DB on miss	Simplifies app logic

Eviction policies: LRU (least recently used), LFU (least frequently used), FIFO (first in, first out), TTL (time-based).

Rate Limiting

import time
from collections import defaultdict

class SlidingWindowRateLimiter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.requests = defaultdict(list)

    def allow(self, client_id):
        now = time.time()
        window_start = now - self.window
        self.requests[client_id] = [t for t in self.requests[client_id] if t > window_start]
        if len(self.requests[client_id]) < self.limit:
            self.requests[client_id].append(now)
            return True
        return False

Algorithms: Token bucket (allows bursts), Leaky bucket (smooths traffic), Fixed window (simple but spike-prone at boundaries), Sliding window (more accurate, more memory).

Step-by-Step Design Approach

Step 1: Requirements Clarification

Spend the first 3-5 minutes asking questions. This is a primary evaluation signal — skipping it is the fastest way to fail.

Who are the users? How many?
What are the key features?
What are the read/write ratios?
What are the latency and availability requirements?
Any geographic or compliance considerations?

Step 2: Capacity Estimation

Quick math to validate architectural decisions. Use round numbers and keep the arithmetic simple. Add 30% headroom for traffic spikes and maintenance.

def estimate_traffic(daily_active_users, requests_per_user):
    daily_requests = daily_active_users * requests_per_user
    average_qps = daily_requests / 86400
    peak_qps = average_qps * 3
    return {"average_qps": average_qps, "peak_qps": peak_qps}

estimates = estimate_traffic(daily_active_users=100_000_000, requests_per_user=10)
print(estimates)

Step 3: High-Level Design

Draw the system architecture: clients, load balancers, API gateways, application servers, databases, caches, message queues, and CDNs. Explain the data flow end to end.

graph LR
    Client --> LB[Load Balancer]
    LB --> API[API Servers]
    API --> Cache[Redis Cache]
    API --> DB[(Database)]
    API --> MQ[Message Queue]
    MQ --> Workers[Background Workers]
    CDN[CDN] --> Client

Step 4: Deep Dive

Focus on 1-2 components. Discuss database schema, caching strategy, sharding approach, and consistency trade-offs. Justify every decision with explicit alternatives.

Step 5: Identify Bottlenecks

Find single points of failure. Discuss how the system handles 10x the current load. Address data replication, partition tolerance, and disaster recovery.

Step 6: Wrap Up

Summarize the design, restate trade-offs, and suggest future improvements. Mention monitoring, observability, and operational concerns — these signal production experience.

Rank	Question	Key Concepts Tested
1	URL Shortener	Hashing, database sharding, caching, redirects
2	Social Media Feed	Fan-out strategies, ranking, message queues
3	Messaging System	WebSockets, message delivery guarantees, encryption
4	Video Streaming (YouTube)	Transcoding, CDN, adaptive streaming (HLS/DASH)
5	Photo Sharing (Instagram)	Object storage, CDN, feed generation
6	Collaborative Editing (Google Docs)	CRDTs/OT, WebSockets, versioning
7	Web Crawler	BFS/DFS, URL frontier, bloom filters, politeness
8	Rate Limiter	Token bucket, sliding window, distributed counters
9	Cloud File Storage (Drive/Dropbox)	Chunk storage, deduplication, sync
10	Autocomplete/Typeahead	Trie, ranking, in-memory cache, <100ms latency
11	Ride-Sharing (Uber)	Geohashing, QuadTrees, real-time matching
12	Distributed Cache	Consistent hashing, eviction policies, replication
13	Event Booking (Ticketmaster)	Concurrency control, queueing, idempotency
14	Notification System	Multi-channel delivery, priority queues, retries
15	Distributed Task Scheduler	Job queuing, worker pools, exactly-once processing

GenAI System Design

RAG Pipeline Architecture

Retrieval-Augmented Generation (RAG) combines search with generation — retrieve relevant chunks from a vector store, then feed them as context into an LLM prompt.

graph LR
    Query[User Query] --> Encoder[Query Encoder]
    Encoder --> VectorDB[(Vector Database)]
    VectorDB --> Retrieve[Top-K Retrieval]
    Retrieve --> Prompt[Context Assembly]
    Prompt --> LLM[LLM Inference]
    LLM --> Validate[Output Validator]
    Validate --> Response[Final Response]
    Feedback[User Feedback] --> Retrain[Retrieval Tuning]

Key design decisions:

Chunking strategy: Semantic segmentation of documents into retrievable chunks (256-512 tokens)
Embedding model: SentenceTransformers, text-embedding-3-small, or BGE embeddings
Vector database: Pinecone, Weaviate, Milvus, or pgvector for embedding storage
Model routing: Use smaller/distilled models for routine queries, larger LLMs for complex reasoning
Caching: Multi-layer caching (retrieval, prompt, and response) to reduce cost and latency
Safety: Content filters, prompt injection prevention, confidence estimation with fallback

LLM Serving Infrastructure

class InferenceRouter:
    def __init__(self, small_model, large_model):
        self.small = small_model
        self.large = large_model
        self.cache = {}

    def route(self, prompt, complexity):
        cache_key = hash(prompt)
        if cache_key in self.cache:
            return self.cache[cache_key]
        if complexity == "simple":
            response = self.small.generate(prompt)
        else:
            response = self.large.generate(prompt)
        self.cache[cache_key] = response
        return response

Considerations: GPU resource management (batch inference, quantization, model parallelism), token cost optimization (prompt compression, caching, model tiering), and monitoring (per-token cost, latency distributions, hallucination rates).

Common Mistakes

Jumping into design without clarifying requirements is the most frequently cited reason for rejection. Spend the first 5 minutes gathering requirements. Other common mistakes:

Skipping back-of-the-envelope calculations signals you cannot reason about scale
Ignoring trade-offs — do not just pick a technology, explain why it beats alternatives
Happy-path-only design — always discuss failover, replication, retries, and circuit breakers
Over-engineering — start simple and add complexity only when requirements demand it
Poor time management — spending 15 minutes on database schema leaves no time for caching or scaling
Designing in silence — narrate your thinking so the interviewer can assess your reasoning
Outdated hardware numbers — using old server benchmarks signals unfamiliarity with modern production systems

Conclusion

System design interviews evaluate your ability to make reasoned architectural trade-offs under uncertainty. The candidates who succeed are not the ones who memorize the most architectures — they are the ones who recognize patterns, reason through trade-offs explicitly, and communicate clearly.

Structure your approach: clarify requirements, estimate scale, design the data model, outline the core flow, then discuss bottlenecks and trade-offs. In 2026, add GenAI system design familiarity, cost-aware thinking, and operational maturity to your preparation. The thought process matters more than the final design.

Resources

Designing Data-Intensive Applications — Martin Kleppmann’s definitive guide to distributed systems
System Design Interview — An Insider’s Guide — Alex Xu’s practical interview preparation
System Design Primer — Comprehensive open-source resource with curated content
ByteByteGo — Visual explanations of real-world system architectures
High Scalability Blog — Real-world architecture breakdowns
AWS Well-Architected Framework — Production architecture best practices

What is System Design?

Common System Design Concepts

CAP Theorem

ACID vs BASE

Consistency Patterns

The 2026 Interview Landscape

GenAI System Design: The New Standalone Category

Seven Concepts Gaining Prominence

Updated Hardware Benchmarks

Company-Specific Interview Formats

Google

Amazon

Meta

Netflix

Seven Must-Know Design Patterns

1. Scaling Reads

2. Scaling Writes

3. Real-time Updates

4. Long-Running Tasks

5. Dealing with Contention

6. Large Blobs

7. Multi-Step Processes

Building Blocks

Load Balancing

Caching Strategies

Rate Limiting

Step-by-Step Design Approach

Step 1: Requirements Clarification

Step 2: Capacity Estimation

Step 3: High-Level Design

Step 4: Deep Dive

Step 5: Identify Bottlenecks

Step 6: Wrap Up

Top 15 Questions by Frequency

GenAI System Design

RAG Pipeline Architecture

LLM Serving Infrastructure

Common Mistakes

Conclusion

Resources

Related Articles

Comments

Share this article

👍 Was this article helpful?