Introduction
Understanding ClickHouse’s internal architecture helps developers and DBAs optimize queries, design efficient tables, and troubleshoot performance issues. ClickHouse’s unique design prioritizes analytical query performance through columnar storage and vectorized processing.
This article explores ClickHouse’s internal mechanisms, including the MergeTree storage engine, query processing pipeline, and the architectural decisions that enable its exceptional performance.
Architecture Overview
System Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ClickHouse Architecture โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Client Layer โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ HTTP Client โ โ TCP Client โ โ Native Protocolโ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Query Processing Layer โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Parser โ โ Optimizer โ โ Interpreter โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Core Execution Engine โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Vectorized Query Pipeline (Processors) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Storage Engine Layer โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MergeTree Family (MergeTree, ReplicatedMergeTree)โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ File System Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Query Processing Flow
SQL Query
โ
โผ
โโโโโโโโโโโโโโโ
โ Parser โ 1. Parse SQL to AST
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Analyzer โ 2. Name resolution, type checking
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Optimizer โ 3. Query plan optimization
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Interpreterโ 4. Build pipeline
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ Pipeline โ 5. Vectorized execution
โโโโโโโโโโโโโโโ
โ
โผ
Result Set
MergeTree Storage Engine
Storage Structure
ClickHouse stores data in parts, where each part contains data sorted by the ORDER BY key:
/var/lib/clickhouse/data/database/table/
โโโ partition_value/
โ โโโ minmax_date/
โ โ โโโ checksums.txt
โ โ โโโ columns.txt
โ โ โโโ count.txt
โ โ โโโ primary.idx
โ โ โโโ {column1}.bin
โ โ โโโ {column2}.bin
โ โ โโโ ...
Data Storage Format
Column File (.bin):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Column Data โ
โ โโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโ โ
โ โ Row1 โ Row2 โ Row3 โ ... โ โ
โ โโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Primary Index (primary.idx):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Primary Key Index โ
โ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ โ
โ โ Key 1 โ โ Key 2 โ โ Key 3 โ โ
โ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Merge Process
-- Background merge process:
-- 1. Select parts to merge (similar partition, similar size)
-- 2. Read data from source parts
-- 3. Sort and merge
-- 4. Write to new part
-- 5. Replace old parts
-- Manual merge
OPTIMIZE TABLE events FINAL;
-- Check merge progress
SELECT
partition,
sum(rows) as rows,
count() as parts
FROM system.parts
WHERE active AND table = 'events'
GROUP BY partition;
Columnar Storage
Data Layout
Row-oriented (Traditional):
โโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโ
โ name โ salary โ department โ
โโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโค
โ John โ 50000 โ Eng โ
โ Jane โ 60000 โ Sales โ
โ Bob โ 70000 โ Eng โ
โโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโ
Column-oriented (ClickHouse):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ name: [John, Jane, Bob] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ salary: [50000, 60000, 70000]โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ department: [Eng, Sales, Eng]โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Compression
ClickHouse uses multiple compression algorithms:
-- Compression codecs
-- LZ4: Default, fast
-- ZSTD: Better compression, slower
-- Delta: For ordered data
-- T64: For numeric data
-- Specify codec
CREATE TABLE compressed (
id UInt64,
value Float64 CODEC(Delta, ZSTD)
) ENGINE = MergeTree()
ORDER BY id;
-- Automatic codec selection
-- ClickHouse chooses based on data type
Query Processing
Vectorized Execution
ClickHouse processes data in vectors (batches), not row-by-row:
// Simplified vectorized processing
for (batch in data) {
// Process entire batch with SIMD
result = batch.column1 + batch.column2; // Vectorized add
}
Query Pipeline
-- Example: Simple aggregation
SELECT event_name, count() FROM events GROUP BY event_name
-- Pipeline:
-- 1. Source โ Read from column files
-- 2. Filter โ Apply WHERE conditions
-- 3. GroupBy โ Hash aggregation
-- 4. Sort โ Order by
-- 5. Limit โ Apply LIMIT
-- 6. Sink โ Return results
Parallel Execution
-- ClickHouse uses multiple threads per query
-- 1. Split data by parts
-- 2. Process each part in parallel
-- 3. Merge results
-- Control parallelism
SET max_threads = 8;
-- Check query threads
EXPLAIN PIPELINE
SELECT count() FROM events;
Indexing
Primary Index
-- Primary index structure (primary.idx)
-- Contains first key value of every index_granularity (default 8192) rows
-- Example: Primary key (event_name, timestamp)
-- Index contains:
-- event1, 2026-01-01 00:00:00
-- event1, 2026-01-01 08:00:00
-- event2, 2026-01-01 12:00:00
-- ...
Secondary Indexes
-- Skip index (bloom filter)
ALTER TABLE events ADD INDEX idx_name event_name
TYPE bloom_filter
GRANULARITY 4;
-- MinMax index (for ranges)
ALTER TABLE events ADD INDEX idx_timestamp timestamp
TYPE minmax
GRANULARITY 4;
-- Set index (for exact values)
ALTER TABLE events ADD INDEX idx_status status
TYPE set(1000)
GRANULARITY 1;
Memory Management
Memory Pools
-- ClickHouse uses multiple memory pools:
-- 1. Query memory
-- 2. Background pool (merges)
-- 3. System pool (dictionaries)
-- Control query memory
SET max_memory_usage = '16GB';
-- Aggregations to disk
SET max_bytes_before_external_group_by = '4GB';
SET max_bytes_before_external_sort = '4GB';
-- Check memory usage
SELECT
query,
formatReadableSize(memory_usage) as memory
FROM system.query_log
ORDER BY memory_usage DESC
LIMIT 10;
Data Types Implementation
Type Categories
-- Numeric: Int8-Int256, UInt8-UInt256
-- Float: Float32, Float64
-- String: String, FixedString
-- Date: Date, Date32, DateTime, DateTime64
-- Array: Array(T)
-- Tuple: Tuple(T1, T2, ...)
-- Nested: Nested(...)
-- Nullable: Nullable(T)
-- LowCardinality: LowCardinality(T)
Column Implementation
-- Each data type has specialized Column implementation
-- Example: UInt64 column
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ColumnUInt64 โ
โ โโ data: UInt64[] (packed) โ
โ โโ has_null: bool โ
โ โโ default: UInt64 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Operations:
-- - filter(mask) โ Selection
-- - slice(offset, length) โ Subset
-- - replicate(counts) โ Repetition
-- - +,-,*,/ โ New column with result
Distributed Processing
Sharding
-- Sharding key determines data distribution
CREATE TABLE events (
id UInt64,
user_id UInt32,
event_name String
) ENGINE = Distributed(
cluster_3_shards,
database,
events_local,
sipHash64(user_id) -- Shard by user_id
);
-- Data flow:
-- 1. Calculate shard: hash(user_id) % num_shards
-- 2. Route query to appropriate shard
-- 3. Aggregate results
Replication
-- ReplicatedMergeTree maintains copies
-- ZooKeeper/Keeper coordinates
-- Write path:
-- 1. Write to local part
-- 2. Add entry to ZooKeeper queue
-- 3. Other replicas pull and apply
-- Read path:
-- 1. Read from local (preferred)
-- 2. Or read from replica if local unavailable
Materialized Views
Implementation
-- Materialized view structure
CREATE MATERIALIZED VIEW hourly_stats
ENGINE = SummingMergeTree()
ORDER BY (event_name, hour)
AS SELECT
event_name,
toStartOfHour(timestamp) as hour,
count() as cnt,
sum(value) as total_value
FROM events
GROUP BY event_name, hour;
-- How it works:
-- 1. Background thread watches source table
-- 2. New parts trigger view calculation
-- 3. Results aggregated into view
Performance Characteristics
Why ClickHouse is Fast
- Columnar Storage: Less I/O for analytical queries
- Vectorized Execution: SIMD processing
- Compression: Less memory/bandwidth
- Primary Key Index: Efficient data skipping
- Parallel Processing: Multi-core utilization
- Skip Indexes: Additional data pruning
- MPP Architecture: Distributed processing
Benchmark Results
-- Typical performance:
-- - 1B row aggregation: < 5 seconds
-- - 10B row scan: < 30 seconds
-- - Point query on 1B rows: < 100ms
Best Practices
Table Design
-- Order by columns that are:
-- 1. Commonly filtered (=, IN)
-- 2. Commonly sorted
-- 3. Low cardinality for first columns
-- Good
ORDER BY (event_name, timestamp, user_id)
-- Partition by time for TTL and efficient scans
PARTITION BY toYYYYMM(timestamp)
Query Optimization
-- Filter on partition key first
SELECT * FROM events WHERE timestamp >= '2026-01-01' AND event_name = 'click';
-- Use projection for common patterns
ALTER TABLE events ADD PROJECTION (event_name, toDate(timestamp));
-- Limit columns
SELECT event_name, count() FROM events GROUP BY event_name;
Common Pitfalls
Pitfall 1: Wrong Primary Key
-- Bad: High cardinality first
ORDER BY (user_id, timestamp)
-- Good: Filter columns first
ORDER BY (event_name, timestamp)
Pitfall 2: Too Many Partitions
-- Bad: Too fine-grained
PARTITION BY toYYYYMMDD(timestamp)
-- Good: Monthly partitions
PARTITION BY toYYYYMM(timestamp)
Pitfall 3: Not Using Batch Inserts
-- Bad: Row-by-row
INSERT INTO events VALUES (1, 'a');
-- Good: Batch insert
INSERT INTO events VALUES (1,'a'),(2,'b'),...;
Resources
Conclusion
ClickHouse’s internal architecture combines columnar storage, vectorized execution, and efficient indexing to deliver exceptional analytical performance. Understanding these internals helps optimize queries and design efficient tables.
In the next article, we’ll explore ClickHouse trends in 2025-2026, including new features, AI integrations, and cloud developments.
Comments