Skip to main content

ClickHouse: The Complete Guide to Columnar Analytics Database

Published: March 5, 2026 Updated: May 23, 2026 Larry Qu 17 min read

Introduction

ClickHouse is an open-source column-oriented database management system that enables real-time analytical data processing. Known for exceptional query performance on massive datasets, ClickHouse powers analytics at major companies like Cloudflare, Spotify, and eBay.

In 2026, ClickHouse continues to lead the OLAP space. Recent releases (25.x through 26.1) have brought lightweight UPDATEs, GA for vector similarity search, a native Parquet reader, automatic join reordering, and the JSON data type. The company raised billions in funding and expanded beyond analytics into AI workloads.

This guide covers ClickHouse fundamentals — data types, installation, SQL dialect, table engines, materialized views, projections, dictionaries, time-series functions, and production best practices. See the companion articles on operations, internal architecture, and use cases.

What is ClickHouse?

ClickHouse is a column-oriented DBMS designed for Online Analytical Processing (OLAP). It excels at processing complex queries on petabyte-scale data with sub-second latency.

Key Characteristics

  • Columnar Storage: Data stored by columns, not rows — enables high compression and fast scans
  • Vectorized Execution: Processes data in batches using SIMD CPU instructions
  • Distributed Architecture: Scale horizontally across multiple nodes
  • Compression: Highly efficient compression (5-10x typical) through columnar layout and codecs
  • Real-time Ingestion: Supports high-throughput data loading with sub-second visibility
  • SQL Support: Familiar SQL dialect with ClickHouse-specific extensions
  • MPP Architecture: Massively Parallel Processing across CPU cores and nodes

ClickHouse vs Other Databases

Aspect ClickHouse PostgreSQL Elasticsearch
Type Columnar OLAP Row-based OLTP Search engine
Query Speed Milliseconds Seconds Seconds
Data Size PB scale TB scale PB scale
Joins Good (since 25.x) Full Limited
SQL Support Extensive Full Limited
Transactions None Full ACID Document-level
UPDATE/DELETE Async mutations / lightweight Standard Standard
Compression 5-10x 1-3x 1-2x

ClickHouse Data Types

Numeric Types

-- Integer types
CREATE TABLE integers (
    a UInt8,     -- 0 to 255
    b UInt16,    -- 0 to 65535
    c UInt32,    -- 0 to 4294967295
    d UInt64,    -- 0 to 2^64-1
    e Int8,      -- -128 to 127
    f Int32,     -- -2^31 to 2^31-1
    g Int64,     -- -2^63 to 2^63-1
    h Int128,    -- for very large integers
    i Int256     -- for astronomical values
);

-- Use UInt over Int when values are always positive (more efficient)

-- Floating-point
CREATE TABLE measurements (
    temperature Float32,   -- ~7 decimal digits
    pressure Float64       -- ~15 decimal digits
);

-- Float has precision issues — use Decimal for financial data
SELECT 1 - 0.9;  -- 0.09999999999999998

-- Decimal for precision
CREATE TABLE financials (
    revenue Decimal(15, 2),  -- 15 total digits, 2 after decimal
    tax_rate Decimal(5, 4)
);

String Types

-- String is variable-length, no encoding enforced
CREATE TABLE text_data (
    name String,
    description String
);

-- FixedString(n) for fixed-size data (more efficient for known lengths)
CREATE TABLE codes (
    iso_code FixedString(2),  -- Exactly 2 bytes
    sha256_hash FixedString(32)
);

Date and Time Types

CREATE TABLE events (
    event_date Date,               -- 2 bytes, range 1970-2149
    event_datetime DateTime,       -- 4 bytes, second precision
    event_datetime64 DateTime64(3), -- configurable sub-second precision
    event_date32 Date32            -- extended range, 4 bytes
);

-- Insert examples
INSERT INTO events VALUES (
    '2026-05-23',
    '2026-05-23 14:30:00',
    toDateTime64('2026-05-23 14:30:00.123', 3),
    '2026-05-23'
);

-- Time extraction
SELECT
    toYear(event_date) AS year,
    toMonth(event_date) AS month,
    toDayOfMonth(event_date) AS day,
    toHour(event_datetime) AS hour,
    toDayOfWeek(event_date) AS weekday
FROM events;

Array Types

CREATE TABLE arrays (
    id UInt64,
    scores Array(Int32),
    tags Array(String),
    nested Array(Array(Float64))
);

INSERT INTO arrays VALUES (1, [95, 87, 92], ['sql', 'python'], [[1.5, 2.3]]);

-- Array functions
SELECT arrayCount(x -> x > 90, [95, 87, 92]) AS above_90;  -- 2
SELECT arraySum([10, 20, 30]) AS total;                     -- 60
SELECT arraySort([3, 1, 2]);                                 -- [1, 2, 3]
SELECT arrayFilter(x -> x > 1, [1, 2, 3]);                  -- [2, 3]
SELECT arrayMap(x -> x * 2, [1, 2, 3]);                     -- [2, 4, 6]
SELECT arrayJoin([1, 2, 3]) AS num;                          -- expands rows: 1, 2, 3

Tuple and Nested Types

-- Tuple groups related values of different types
CREATE TABLE locations (
    id UInt64,
    coordinates Tuple(lat Float64, lng Float64),
    address Tuple(street String, city String, zip String)
);

INSERT INTO locations VALUES (1, (40.7128, -74.0060), ('Broadway', 'NYC', '10001'));

-- Access tuple elements by index (1-based) or name
SELECT coordinates.1 AS lat, coordinates.2 AS lng FROM locations;
SELECT address.city AS city FROM locations;

-- Nested type: an array of structs within a column
CREATE TABLE user_actions (
    user_id UInt64,
    actions Nested(
        action_type String,
        timestamp DateTime,
        duration UInt32
    )
);

INSERT INTO user_actions VALUES (1,
    ['page_view', 'click', 'purchase'],
    ['2026-05-23 10:00:00', '2026-05-23 10:01:00', '2026-05-23 10:05:00'],
    [30, 5, 120]
);

-- Querying nested columns
SELECT user_id, actions.action_type, actions.timestamp
FROM user_actions
ARRAY JOIN actions;

Nullable and LowCardinality

-- Nullable for optional values
CREATE TABLE profiles (
    id UInt64,
    name String,
    email Nullable(String),   -- can be NULL
    phone Nullable(String)
);

INSERT INTO profiles VALUES (1, 'Alice', '[email protected]', NULL);

-- LowCardinality optimizes columns with few distinct values (< 10k)
-- Uses dictionary encoding: less storage, faster scans
CREATE TABLE events (
    timestamp DateTime,
    country LowCardinality(String),  -- ~200 distinct values
    browser LowCardinality(String),  -- ~20 distinct values
    os LowCardinality(String)        -- ~10 distinct values
) ENGINE = MergeTree()
ORDER BY timestamp;

JSON Data Type

ClickHouse supports native JSON as a first-class column type (GA since 25.x). It automatically parses JSON at insert time and stores sub-keys as typed subcolumns.

-- Enable JSON type (default in recent versions)
SET allow_experimental_object_type = 1;

CREATE TABLE api_logs (
    timestamp DateTime,
    payload JSON
) ENGINE = MergeTree()
ORDER BY timestamp;

INSERT INTO api_logs VALUES
    (now(), '{"user": "alice", "action": "login", "ip": "10.0.0.1"}'),
    (now(), '{"user": "bob", "action": "purchase", "amount": 49.99, "items": 3}');

-- Query JSON sub-keys directly
SELECT
    payload.user,
    payload.action,
    payload.amount
FROM api_logs
WHERE payload.action = 'purchase';

-- Type coercion and path discovery
SELECT payload.amount, toTypeName(payload.amount) FROM api_logs;

ClickHouse SQL Dialect

ClickHouse SQL looks familiar but has important differences from standard SQL.

Key Differences from PostgreSQL

Feature PostgreSQL ClickHouse
UPDATE/DELETE Standard, fast Mutations (async, heavy) or lightweight (25.7+)
Transactions Full ACID None
JOINs Full support Good, prefer dictionary for key lookups
CTEs Recursive supported Non-recursive only
Subqueries Full Full, but correlated limited
Stored procedures Yes (PL/pgSQL) No
Triggers Yes No (use materialized views)
Full-text search tsvector ngram/token bloom filters
Array operations Limited Excellent (first-class type)
Approximate aggregates Extensions Built-in (quantile, uniq)
Window functions Full Full (since 21.x)

Mutations: How ClickHouse Handles Updates

Traditional UPDATE/DELETE using mutations is asynchronous and rewrites entire data parts.

-- Mutation: async, rewrites affected parts
ALTER TABLE events UPDATE status = 'processed' WHERE created_at < '2026-01-01';
ALTER TABLE events DELETE WHERE created_at < '2025-01-01';

-- Check mutation progress
SELECT * FROM system.mutations WHERE table = 'events';

-- Cancel a stuck mutation
KILL MUTATION WHERE database = 'default' AND table = 'events';

Since version 25.7, ClickHouse supports lightweight UPDATE using a patch-part mechanism — it writes only the modified values plus metadata, rather than rewriting entire columns. This delivers up to 2,400x speedup for single-row updates.

-- Lightweight UPDATE (25.7+, enabled by default in 25.8)
SET apply_lightweight_updates = 1;
UPDATE events SET status = 'archived' WHERE id = 42;

Lightweight DELETE has been GA since version 23.3.

-- Lightweight DELETE (23.3+, GA)
DELETE FROM events WHERE id = 42;

ClickHouse Installation

# Single-node development instance
docker run -d \
    --name clickhouse-server \
    --ulimit nofile=262144 \
    -p 8123:8123 \
    -p 9000:9000 \
    clickhouse/clickhouse-server

# With persistent storage
docker run -d \
    --name clickhouse-server \
    -v clickhouse_data:/var/lib/clickhouse \
    -p 8123:8123 \
    -p 9000:9000 \
    clickhouse/clickhouse-server

# Connect via native client
docker exec -it clickhouse-server clickhouse-client

Linux Package Installation

# Ubuntu/Debian
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754
echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client

# Start the server
sudo service clickhouse-server start

# Connect
clickhouse-client

Python Integration

pip install clickhouse-connect
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='localhost',
    port=8123,
    username='default',
    password=''
)

client.command("""
    CREATE TABLE IF NOT EXISTS page_views (
        id UInt64,
        url String,
        user_id UInt32,
        timestamp DateTime,
        duration Float64
    ) ENGINE = MergeTree()
    ORDER BY (url, timestamp)
""")

client.insert('page_views', [
    [1, '/home', 101, '2026-05-23 10:00:00', 45.2],
    [2, '/pricing', 102, '2026-05-23 10:05:00', 120.0],
], column_names=['id', 'url', 'user_id', 'timestamp', 'duration'])

result = client.query("""
    SELECT url, count() AS views, avg(duration) AS avg_duration
    FROM page_views
    GROUP BY url
    ORDER BY views DESC
""")
print(result.result_rows)
client.close()

Table Engines

MergeTree Family

MergeTree is the primary table engine family. All production tables use some variant.

-- Standard MergeTree
CREATE TABLE events (
    id UInt64,
    event_name String,
    user_id UInt32,
    timestamp DateTime,
    value Float64
) ENGINE = MergeTree()
ORDER BY (event_name, timestamp)
PARTITION BY toYYYYMM(timestamp)
TTL timestamp + INTERVAL 3 MONTH;

-- ReplacingMergeTree: deduplicates rows with the same ORDER BY key
-- The version column determines which row to keep (higher = newer)
CREATE TABLE latest_events (
    id UInt64,
    version UInt32,
    status String,
    data String
) ENGINE = ReplacingMergeTree(version)
ORDER BY id;

INSERT INTO latest_events VALUES (1, 1, 'pending', 'data_v1');
INSERT INTO latest_events VALUES (1, 2, 'processed', 'data_v2');

SELECT * FROM latest_events FINAL;
-- Only row with version=2 appears

-- SummingMergeTree: pre-aggregates numeric columns
CREATE TABLE daily_metrics (
    metric_date Date,
    metric_name String,
    value SummingMergeTree(Int64),
    count UInt64
) ENGINE = SummingMergeTree()
ORDER BY (metric_date, metric_name);

-- AggregatingMergeTree: stores intermediate aggregation states
CREATE TABLE hourly_stats (
    hour DateTime,
    metric_name String,
    avg_value AggregateFunction(avg, Float64),
    quantiles AggregateFunction(quantile(0.5, 0.95), Float64)
) ENGINE = AggregatingMergeTree()
ORDER BY (hour, metric_name);

Special Engines

-- Log family: lightweight, for small tables or staging data
CREATE TABLE log_table (
    id UInt64,
    data String
) ENGINE = Log;       -- TinyLog, Log, StripeLog

-- Memory: in-memory, data lost on restart
CREATE TABLE cache_table (
    key String,
    value String
) ENGINE = Memory;

-- Buffer: writes to memory then flushes to another table
CREATE TABLE buffer_events AS events
ENGINE = Buffer(currentDatabase(), events, 16, 10, 100, 10000, 1000000, 10000000, 100000000);

-- Distributed: sharded access across a cluster
CREATE TABLE distributed_events AS events
ENGINE = Distributed(cluster_name, database_name, events, sipHash64(user_id));

Integration Engines

-- MySQL: query a remote MySQL table directly
CREATE TABLE mysql_orders
ENGINE = MySQL('mysql-host:3306', 'ecommerce', 'orders', 'user', 'password');

-- PostgreSQL: query a remote PostgreSQL table
CREATE TABLE pg_users
ENGINE = PostgreSQL('pg-host:5432', 'analytics', 'users', 'user', 'password');

-- Kafka: consume from Kafka topics
CREATE TABLE kafka_stream
ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'user-events',
    kafka_group_name = 'clickhouse-consumer',
    kafka_format = 'JSONEachRow';

-- S3: query data in S3-compatible storage
CREATE TABLE s3_logs
ENGINE = S3('https://s3.amazonaws.com/bucket/logs/*.tsv', 'TSV');

-- URL: query an HTTP endpoint
CREATE TABLE api_data
ENGINE = URL('https://api.example.com/data', 'JSONEachRow');

-- File: read local files
CREATE TABLE file_data
ENGINE = File('CSV', '/data/export.csv');

Materialized Views

Materialized views are ClickHouse’s primary mechanism for incremental data transformation. Unlike PostgreSQL, ClickHouse materialized views are insert triggers — they fire on INSERT to the source table and push transformed data into a target table.

-- Source table
CREATE TABLE page_views (
    timestamp DateTime,
    user_id UInt64,
    page_url String,
    duration UInt32
) ENGINE = MergeTree()
ORDER BY (page_url, timestamp);

-- Target table for pre-aggregated results
CREATE TABLE daily_page_stats (
    day Date,
    page_url String,
    views UInt64,
    unique_users AggregateFunction(uniq, UInt64),
    avg_duration AggregateFunction(avg, UInt32)
) ENGINE = AggregatingMergeTree()
ORDER BY (day, page_url);

-- Materialized view: transforms data on insert
CREATE MATERIALIZED VIEW daily_page_stats_mv
TO daily_page_stats
AS SELECT
    toDate(timestamp) AS day,
    page_url,
    count() AS views,
    uniqState(user_id) AS unique_users,
    avgState(duration) AS avg_duration
FROM page_views
GROUP BY day, page_url;

Key rules:

  • Use the TO clause (never POPULATE on tables larger than a few million rows)
  • Each MV is an insert trigger — it does not retroactively see existing data
  • Materialized views can chain: the target of one MV can be the source of another
  • MVs cannot see UPDATE/DELETE on the source table

Projections

Projections let you define alternative sort orders or pre-computed aggregations within the same table. They are maintained automatically and used transparently by the query optimizer.

-- Add a projection to an existing table for common query patterns
ALTER TABLE page_views
ADD PROJECTION hourly_count (
    SELECT
        toStartOfHour(timestamp) AS hour,
        page_url,
        count() AS views
    GROUP BY hour, page_url
);

-- Materialize the projection (builds it for existing data)
ALTER TABLE page_views MATERIALIZE PROJECTION hourly_count;

-- Query the table — ClickHouse automatically uses the projection when beneficial
SELECT toStartOfHour(timestamp) AS hour, count() AS views
FROM page_views
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY hour;

Projections vs. Materialized Views:

Aspect Projections Materialized Views
Consistency Atomic with source table May lag (async trigger)
Query transparency Automatic optimizer selection Must query target table
JOIN support No Yes
Storage Stored within table parts Separate table
Maintenance Automatic with partition merges Manual target table management
Lightweight UPDATE Not supported Compatible

Dictionaries

Dictionaries provide in-memory key-value storage for fast lookups, eliminating JOIN overhead for dimension data.

-- Create a dictionary from a ClickHouse table
CREATE DICTIONARY country_lookup
(
    country_code String,
    country_name String,
    region String,
    is_eu UInt8
)
PRIMARY KEY country_code
SOURCE(CLICKHOUSE(TABLE 'country_dim'))
LAYOUT(HASHED())
LIFETIME(MIN 300 MAX 600);

-- Lookup a single value with dictGet
SELECT dictGet('country_lookup', 'region', 'US');

-- Use dictionary in a query (avoids JOIN)
SELECT
    e.user_id,
    dictGet('country_lookup', 'region', e.country_code) AS region,
    count() AS events
FROM events AS e
GROUP BY region;

-- Inspect dictionary content
SELECT * FROM dictionary('country_lookup') WHERE is_eu = 1;

Dictionaries support multiple source types: ClickHouse tables, PostgreSQL, MySQL, HTTP endpoints, and local files. The LAYOUT option determines the in-memory data structure — HASHED() for general use, FLAT() for small dictionaries, RANGE_HASHED() for time-range lookups.

Basic SQL Operations

Creating Tables

-- MergeTree with explicit partition and TTL
CREATE TABLE events (
    id UInt64,
    event_name String,
    user_id UInt32,
    timestamp DateTime,
    value Float64
) ENGINE = MergeTree()
ORDER BY (event_name, timestamp)
PARTITION BY toYYYYMM(timestamp)
TTL timestamp + INTERVAL 3 MONTH;

-- Table with sampling (for approximate queries)
CREATE TABLE sampled_events (
    id UInt64,
    event_name String,
    user_id UInt32,
    timestamp DateTime
) ENGINE = MergeTree()
ORDER BY (event_name, timestamp)
SAMPLE BY user_id;

-- Replicated table (requires ZooKeeper)
CREATE TABLE replicated_events (
    id UInt64,
    name String,
    value Float64
) ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',
    '{replica}',
    ORDER BY id
);

INSERT Operations

-- Single row
INSERT INTO events (id, event_name, user_id, timestamp, value)
VALUES (1, 'click', 100, '2026-05-23 10:00:00', 10.5);

-- Multiple rows
INSERT INTO events VALUES
    (2, 'view', 101, '2026-05-23 10:05:00', 5.5),
    (3, 'purchase', 102, '2026-05-23 10:10:00', 99.99);

-- From SELECT (1 million rows)
INSERT INTO events (event_name, user_id, timestamp, value)
SELECT 'auto_event', number, now(), rand() % 1000
FROM numbers(1000000);

-- Batch inserts of 1,000+ rows for best performance
-- Tiny batches create too many parts and trigger merges

SELECT Queries

-- Basic aggregation
SELECT
    event_name,
    count() AS event_count,
    uniq(user_id) AS unique_users,
    avg(value) AS avg_value,
    min(value) AS min_val,
    max(value) AS max_val,
    quantile(0.95)(value) AS p95_value
FROM events
WHERE timestamp >= '2026-01-01'
GROUP BY event_name
ORDER BY event_count DESC
LIMIT 10;

-- Time series bucketing
SELECT
    toStartOfHour(timestamp) AS hour,
    event_name,
    count() AS events
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY hour, event_name
ORDER BY hour;

-- Approximate distinct count (faster than exact for large datasets)
SELECT uniqHLL12(user_id) AS approx_unique_users FROM events;

-- Top-N with sampling
SELECT event_name, count() AS cnt
FROM events SAMPLE 0.1
GROUP BY event_name
ORDER BY cnt DESC
LIMIT 5;

JOINs

CREATE TABLE users (id UInt32, name String, country String) ENGINE = TinyLog;
CREATE TABLE orders (id UInt64, user_id UInt32, amount Float64) ENGINE = TinyLog;

INSERT INTO users VALUES (1, 'Alice', 'US'), (2, 'Bob', 'UK'), (3, 'Carol', 'US');
INSERT INTO orders VALUES (1, 1, 100), (2, 1, 200), (3, 2, 150);

-- Standard JOIN
SELECT u.name, o.amount
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- GLOBAL JOIN: broadcasts right table to all nodes (use in distributed queries)
SELECT u.name, o.amount
FROM users u
GLOBAL LEFT JOIN orders o ON u.id = o.user_id;

-- For key-value lookups, prefer dictionaries over JOINs
SELECT name, dictGet('order_totals', 'total', id) FROM users;

Time Series Functions

ClickHouse provides powerful functions for working with time-series data.

-- Bucketing functions
SELECT
    toStartOfInterval(timestamp, INTERVAL 1 HOUR) AS bucket,
    -- or: toStartOfHour(timestamp) AS bucket,
    count() AS events
FROM events
GROUP BY bucket
ORDER BY bucket;

-- Date truncation (standard SQL syntax, available since 24.x)
SELECT date_trunc('month', timestamp) AS month, count()
FROM events
GROUP BY month;

-- Time-based difference calculations
SELECT
    event_name,
    timestamp,
    neighbor(timestamp, -1) OVER (ORDER BY timestamp) AS prev_timestamp,
    dateDiff('second', prev_timestamp, timestamp) AS seconds_since_prev
FROM events;

-- Fill gaps in time series
SELECT
    toStartOfHour(timestamp) AS hour,
    count() AS events
FROM events
WHERE timestamp >= now() - INTERVAL 3 DAY
GROUP BY hour
ORDER BY hour
WITH FILL STEP toIntervalHour(1);

Window Functions

-- Running total
SELECT
    id,
    value,
    sum(value) OVER (ORDER BY id) AS running_total
FROM series;

-- Moving average (7-day window)
SELECT
    toDate(timestamp) AS day,
    value,
    avg(value) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7d
FROM daily_metrics;

-- Ranking
SELECT
    department,
    employee,
    salary,
    rank() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
    dense_rank() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_dense_rank
FROM employees;

GROUP BY Extensions

-- Subtotals across all dimension combinations
SELECT event_name, country, count() AS cnt
FROM events
GROUP BY event_name, country WITH CUBE;

-- Hierarchical rollup
SELECT event_name, toDate(timestamp) AS date, count()
FROM events
GROUP BY event_name, date WITH ROLLUP;

-- Grand total row appended
SELECT event_name, count() AS cnt
FROM events
GROUP BY event_name WITH TOTALS;

Table Functions

Table functions let you query external data sources directly without creating tables.

-- Query files in S3
SELECT count() FROM s3('https://s3.amazonaws.com/bucket/logs/2026-05-*.parquet');

-- Query remote ClickHouse table
SELECT * FROM remote('remote-host:9000', 'database', 'events', 'user', 'password');

-- Query a URL
SELECT * FROM url('https://api.example.com/v1/metrics', 'JSONEachRow');

-- Query local file
SELECT * FROM file('/data/export.csv', 'CSV', 'col1 UInt64, col2 String');

-- Query another ClickHouse table in the same server
SELECT * FROM merge('database_name', 'table_pattern');

EXPLAIN and Query Profiling

-- See the query execution plan
EXPLAIN SELECT event_name, count() FROM events WHERE timestamp > '2026-01-01' GROUP BY event_name;

-- With detailed pipeline information
EXPLAIN PIPELINE SELECT count() FROM events;

-- Estimated query cost (rows to read, parts to scan)
EXPLAIN ESTIMATE SELECT count() FROM events WHERE event_name = 'click';

Best Practices

Table Design

-- ORDER BY is for data sorting, NOT unique identification
-- Choose columns commonly used in WHERE filters and ORDER BY clauses

-- Partition by time for TTL-based data lifecycle
PARTITION BY toYYYYMM(timestamp)

-- Use appropriate types: UInt > Int (positive values), LowCardinality (categorical strings)
-- Use FixedString(n) for known-length codes (ISO country codes, hashes)

-- Index granularity: 8192 is default for large tables
-- Lower (1024) for tables with many point queries
-- Higher (16384) for tables with mostly full scans
SETTINGS index_granularity = 8192;

-- Add skip indexes for high-cardinality filter columns
ALTER TABLE events ADD INDEX value_idx value TYPE minmax GRANULARITY 4;

Query Optimization

-- Filter on partition key to prune parts
-- Good: scans only relevant partitions
SELECT count() FROM events WHERE timestamp >= '2026-01-01' AND timestamp < '2026-02-01';

-- Bad: full table scan
SELECT count() FROM events WHERE value > 100;

-- Only SELECT needed columns
-- Good
SELECT event_name, count() FROM events GROUP BY event_name;
-- Bad (reads all columns, then discards most)
SELECT * FROM events;

-- Use projections for common query patterns
ALTER TABLE events ADD PROJECTION by_hour (
    SELECT toStartOfHour(timestamp), event_name, count()
    GROUP BY toStartOfHour(timestamp), event_name
);

Data Ingestion

  • Batch inserts to at least 1,000 rows per statement
  • Avoid single-row INSERTs — they create too many data parts
  • Use asynchronous inserts for high-throughput ingestion from multiple producers
  • Prefer JSONEachRow format for streaming data
-- Enable asynchronous inserts
SET async_insert = 1;
SET wait_for_async_insert = 1;

Common Pitfalls

Pitfall 1: Too Many Data Parts

-- Each INSERT creates a part; too many parts degrade SELECT performance
-- Solution: batch inserts
INSERT INTO events VALUES (1, 'a'), (2, 'b'), (3, 'c');  -- one part

-- Force merge (async by default, runs in background)
OPTIMIZE TABLE events FINAL;

Pitfall 2: Wrong Data Types

-- Float for money → precision errors
-- Bad
CREATE TABLE bad (amount Float64);
-- Good
CREATE TABLE good (amount Decimal(15, 2));

-- String for enum-like columns → wasted storage
-- Bad
CREATE TABLE logs (level String);
-- Good
CREATE TABLE logs (level LowCardinality(String));

Pitfall 3: Missing Partition Key

-- Always partition time-based data for efficient TTL and partition pruning
CREATE TABLE events (
    timestamp DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY timestamp;

Pitfall 4: Using JOINs for Dimension Lookups

-- Joining a billion-row fact table to a million-row dimension table is slow
-- Bad
SELECT e.*, u.name FROM events e LEFT JOIN users u ON e.user_id = u.id;

-- Good: use a dictionary
SELECT e.*, dictGet('users', 'name', e.user_id) AS user_name FROM events e;

Pitfall 5: SELECT * on Wide Tables

-- ClickHouse is columnar — reading all columns reads ALL column files
-- Bad
SELECT * FROM wide_table WHERE timestamp > '2026-01-01';

-- Good
SELECT event_name, count() FROM wide_table WHERE timestamp > '2026-01-01' AND event_name = 'click';

Resources

Comments

👍 Was this article helpful?