ClickHouse Internals: Storage Engine, Query Processing, and Architecture

Introduction

Understanding ClickHouse’s internal architecture helps developers and DBAs optimize queries, design efficient tables, and troubleshoot performance issues. ClickHouse’s unique design prioritizes analytical query performance through columnar storage and vectorized processing.

This article explores ClickHouse’s internal mechanisms, including the MergeTree storage engine, query processing pipeline, and the architectural decisions that enable its exceptional performance.

Architecture Overview

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     ClickHouse Architecture                   │
├─────────────────────────────────────────────────────────────┤
│  Client Layer                                              │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐      │
│  │ HTTP Client  │ │ TCP Client   │ │ Native Protocol│     │
│  └──────────────┘ └──────────────┘ └──────────────┘      │
├─────────────────────────────────────────────────────────────┤
│  Query Processing Layer                                    │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐      │
│  │   Parser    │ │  Optimizer  │ │  Interpreter │      │
│  └──────────────┘ └──────────────┘ └──────────────┘      │
├─────────────────────────────────────────────────────────────┤
│  Core Execution Engine                                     │
│  ┌──────────────────────────────────────────────────┐    │
│  │  Vectorized Query Pipeline (Processors)           │    │
│  └──────────────────────────────────────────────────┘    │
├─────────────────────────────────────────────────────────────┤
│  Storage Engine Layer                                      │
│  ┌──────────────────────────────────────────────────┐    │
│  │  MergeTree Family (MergeTree, ReplicatedMergeTree)│    │
│  └──────────────────────────────────────────────────┘    │
├─────────────────────────────────────────────────────────────┤
│  File System Layer                                         │
└─────────────────────────────────────────────────────────────┘

Query Processing Flow

SQL Query
    │
    ▼
┌─────────────┐
│   Parser   │  1. Parse SQL to AST
└─────────────┘
    │
    ▼
┌─────────────┐
│  Analyzer  │  2. Name resolution, type checking
└─────────────┘
    │
    ▼
┌─────────────┐
│ Optimizer  │  3. Query plan optimization
└─────────────┘
    │
    ▼
┌─────────────┐
│ Interpreter│  4. Build pipeline
└─────────────┘
    │
    ▼
┌─────────────┐
│  Pipeline  │  5. Vectorized execution
└─────────────┘
    │
    ▼
Result Set

MergeTree Storage Engine

Storage Structure

ClickHouse stores data in parts, where each part contains data sorted by the ORDER BY key:

/var/lib/clickhouse/data/database/table/
├── partition_value/
│   ├── minmax_date/
│   │   ├── checksums.txt
│   │   ├── columns.txt
│   │   ├── count.txt
│   │   ├── primary.idx
│   │   ├── {column1}.bin
│   │   ├── {column2}.bin
│   │   └── ...

Data Storage Format

Column File (.bin):
┌────────────────────────────────────┐
│ Column Data                        │
│ ┌──────┬──────┬──────┬──────┐   │
│ │ Row1 │ Row2 │ Row3 │ ...  │   │
│ └──────┴──────┴──────┴──────┘   │
└────────────────────────────────────┘

Primary Index (primary.idx):
┌────────────────────────────────────┐
│ Primary Key Index                   │
│ ┌────────┐ ┌────────┐ ┌────────┐  │
│ │ Key 1 │ │ Key 2 │ │ Key 3 │  │
│ └────────┘ └────────┘ └────────┘  │
└────────────────────────────────────┘

Merge Process

-- Background merge process:
-- 1. Select parts to merge (similar partition, similar size)
-- 2. Read data from source parts
-- 3. Sort and merge
-- 4. Write to new part
-- 5. Replace old parts

-- Manual merge
OPTIMIZE TABLE events FINAL;

-- Check merge progress
SELECT 
    partition,
    sum(rows) as rows,
    count() as parts
FROM system.parts
WHERE active AND table = 'events'
GROUP BY partition;

Columnar Storage

Data Layout

Row-oriented (Traditional):
┌────────┬────────┬────────────┐
│ name   │ salary │ department │
├────────┼────────┼────────────┤
│ John   │ 50000  │ Eng        │
│ Jane   │ 60000  │ Sales      │
│ Bob    │ 70000  │ Eng        │
└────────┴────────┴────────────┘

Column-oriented (ClickHouse):
┌──────────────────────────────┐
│ name: [John, Jane, Bob]    │
├──────────────────────────────┤
│ salary: [50000, 60000, 70000]│
├──────────────────────────────┤
│ department: [Eng, Sales, Eng]│
└──────────────────────────────┘

Compression

ClickHouse uses multiple compression algorithms:

-- Compression codecs
-- LZ4: Default, fast
-- ZSTD: Better compression, slower
-- Delta: For ordered data
-- T64: For numeric data

-- Specify codec
CREATE TABLE compressed (
    id UInt64,
    value Float64 CODEC(Delta, ZSTD)
) ENGINE = MergeTree()
ORDER BY id;

-- Automatic codec selection
-- ClickHouse chooses based on data type

Query Processing

Vectorized Execution

ClickHouse processes data in vectors (batches), not row-by-row:

// Simplified vectorized processing
for (batch in data) {
    // Process entire batch with SIMD
    result = batch.column1 + batch.column2;  // Vectorized add
}

Query Pipeline

-- Example: Simple aggregation
SELECT event_name, count() FROM events GROUP BY event_name

-- Pipeline:
-- 1. Source → Read from column files
-- 2. Filter → Apply WHERE conditions
-- 3. GroupBy → Hash aggregation
-- 4. Sort → Order by
-- 5. Limit → Apply LIMIT
-- 6. Sink → Return results

Parallel Execution

-- ClickHouse uses multiple threads per query
-- 1. Split data by parts
-- 2. Process each part in parallel
-- 3. Merge results

-- Control parallelism
SET max_threads = 8;

-- Check query threads
EXPLAIN PIPELINE
SELECT count() FROM events;

Indexing

Primary Index

-- Primary index structure (primary.idx)
-- Contains first key value of every index_granularity (default 8192) rows

-- Example: Primary key (event_name, timestamp)
-- Index contains:
--   event1, 2026-01-01 00:00:00
--   event1, 2026-01-01 08:00:00
--   event2, 2026-01-01 12:00:00
--   ...

Secondary Indexes

-- Skip index (bloom filter)
ALTER TABLE events ADD INDEX idx_name event_name 
TYPE bloom_filter 
GRANULARITY 4;

-- MinMax index (for ranges)
ALTER TABLE events ADD INDEX idx_timestamp timestamp 
TYPE minmax 
GRANULARITY 4;

-- Set index (for exact values)
ALTER TABLE events ADD INDEX idx_status status 
TYPE set(1000) 
GRANULARITY 1;

Memory Management

Memory Pools

-- ClickHouse uses multiple memory pools:
-- 1. Query memory
-- 2. Background pool (merges)
-- 3. System pool (dictionaries)

-- Control query memory
SET max_memory_usage = '16GB';

-- Aggregations to disk
SET max_bytes_before_external_group_by = '4GB';
SET max_bytes_before_external_sort = '4GB';

-- Check memory usage
SELECT 
    query,
    formatReadableSize(memory_usage) as memory
FROM system.query_log
ORDER BY memory_usage DESC
LIMIT 10;

Data Types Implementation

Type Categories

-- Numeric: Int8-Int256, UInt8-UInt256
-- Float: Float32, Float64
-- String: String, FixedString
-- Date: Date, Date32, DateTime, DateTime64
-- Array: Array(T)
-- Tuple: Tuple(T1, T2, ...)
-- Nested: Nested(...)
-- Nullable: Nullable(T)
-- LowCardinality: LowCardinality(T)

Column Implementation

-- Each data type has specialized Column implementation
-- Example: UInt64 column
-- ┌─────────────────────────────────┐
│ ColumnUInt64                      │
│ ├─ data: UInt64[] (packed)        │
│ ├─ has_null: bool                │
│ └─ default: UInt64               │
└─────────────────────────────────┘

-- Operations:
-- - filter(mask) → Selection
-- - slice(offset, length) → Subset
-- - replicate(counts) → Repetition
-- - +,-,*,/ → New column with result

Distributed Processing

Sharding

-- Sharding key determines data distribution
CREATE TABLE events (
    id UInt64,
    user_id UInt32,
    event_name String
) ENGINE = Distributed(
    cluster_3_shards,
    database,
    events_local,
    sipHash64(user_id)  -- Shard by user_id
);

-- Data flow:
-- 1. Calculate shard: hash(user_id) % num_shards
-- 2. Route query to appropriate shard
-- 3. Aggregate results

Replication

-- ReplicatedMergeTree maintains copies
-- ZooKeeper/Keeper coordinates

-- Write path:
-- 1. Write to local part
-- 2. Add entry to ZooKeeper queue
-- 3. Other replicas pull and apply

-- Read path:
-- 1. Read from local (preferred)
-- 2. Or read from replica if local unavailable

Materialized Views

Implementation

-- Materialized view structure
CREATE MATERIALIZED VIEW hourly_stats
ENGINE = SummingMergeTree()
ORDER BY (event_name, hour)
AS SELECT
    event_name,
    toStartOfHour(timestamp) as hour,
    count() as cnt,
    sum(value) as total_value
FROM events
GROUP BY event_name, hour;

-- How it works:
-- 1. Background thread watches source table
-- 2. New parts trigger view calculation
-- 3. Results aggregated into view

Performance Characteristics

Why ClickHouse is Fast

Columnar Storage: Less I/O for analytical queries
Vectorized Execution: SIMD processing
Compression: Less memory/bandwidth
Primary Key Index: Efficient data skipping
Parallel Processing: Multi-core utilization
Skip Indexes: Additional data pruning
MPP Architecture: Distributed processing

Benchmark Results

-- Typical performance:
-- - 1B row aggregation: < 5 seconds
-- - 10B row scan: < 30 seconds
-- - Point query on 1B rows: < 100ms

Best Practices

Table Design

-- Order by columns that are:
-- 1. Commonly filtered (=, IN)
-- 2. Commonly sorted
-- 3. Low cardinality for first columns

-- Good
ORDER BY (event_name, timestamp, user_id)

-- Partition by time for TTL and efficient scans
PARTITION BY toYYYYMM(timestamp)

Query Optimization

-- Filter on partition key first
SELECT * FROM events WHERE timestamp >= '2026-01-01' AND event_name = 'click';

-- Use projection for common patterns
ALTER TABLE events ADD PROJECTION (event_name, toDate(timestamp));

-- Limit columns
SELECT event_name, count() FROM events GROUP BY event_name;

Common Pitfalls

Pitfall 1: Wrong Primary Key

-- Bad: High cardinality first
ORDER BY (user_id, timestamp)

-- Good: Filter columns first
ORDER BY (event_name, timestamp)

Pitfall 2: Too Many Partitions

-- Bad: Too fine-grained
PARTITION BY toYYYYMMDD(timestamp)

-- Good: Monthly partitions
PARTITION BY toYYYYMM(timestamp)

Pitfall 3: Not Using Batch Inserts

-- Bad: Row-by-row
INSERT INTO events VALUES (1, 'a');

-- Good: Batch insert
INSERT INTO events VALUES (1,'a'),(2,'b'),...;

Resources

Conclusion

ClickHouse’s internal architecture combines columnar storage, vectorized execution, and efficient indexing to deliver exceptional analytical performance. Understanding these internals helps optimize queries and design efficient tables.

In the next article, we’ll explore ClickHouse trends in 2025-2026, including new features, AI integrations, and cloud developments.