Real-time Analytics: ClickHouse, Druid, and Materialized Views

Introduction

Real-time analytics enables organizations to analyze data as it arrives, supporting use cases from operational dashboards to fraud detection. This guide covers leading technologies and implementation patterns.

Technology Comparison

Aspect	ClickHouse	Apache Druid	Materialized Views
Architecture	Column-oriented DB	Column-oriented + inverted index	Query acceleration
Latency	Sub-second	Sub-second	Sub-second
Scalability	Petabyte+	Petabyte+	Depends on base DB
Query Model	SQL	SQL	Pre-computed
Ingestion	Batch + Streaming	Streaming preferred	Batch
Best For	Ad-hoc analytics	Time-series, events	Known queries

ClickHouse

Architecture

ClickHouse is a column-oriented database optimized for analytical queries. It uses distributed architecture with automatic sharding and replication.

Table Design

-- Create distributed table
CREATE TABLE events
(
    event_time DateTime,
    event_type String,
    user_id UInt64,
    platform String,
    country String,
    revenue Float32,
    properties JSON
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, event_time, user_id)
TTL event_time + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;

-- Create distributed version
CREATE TABLE events_distributed
AS events
ENGINE = Distributed('cluster', 'default', 'events', rand());

Queries and Aggregations

-- Time-series aggregation
SELECT
    toStartOfHour(event_time) AS hour,
    event_type,
    count() AS events,
    uniqExact(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE event_time >= now() - INTERVAL 24 HOUR
GROUP BY hour, event_type
ORDER BY hour DESC;

-- Rolling window analysis
SELECT
    event_type,
    count() / 24 AS events_per_hour_avg,
    quantile(0.95)(revenue) AS revenue_p95
FROM events
WHERE event_time >= now() - INTERVAL 7 DAY
GROUP BY event_type;

-- User funnel analysis
SELECT
    countIf(event_type = 'page_view') AS page_views,
    countIf(event_type = 'add_to_cart') AS add_to_carts,
    countIf(event_type = 'purchase') AS purchases,
    purchases / page_views AS conversion_rate
FROM events
WHERE event_time >= now() - INTERVAL 30 DAY;

Python Integration

# clickhouse_client.py
from clickhouse_driver import Client
import pandas as pd

class ClickHouseAnalytics:
    def __init__(self, host, port, database):
        self.client = Client(host=host, port=port, database=database)
    
    def execute(self, query: str) -> list:
        """Execute query and return results."""
        return self.client.execute(query)
    
    def insert_dataframe(self, df: pd.DataFrame, table: str):
        """Insert DataFrame into ClickHouse."""
        columns = ', '.join(df.columns)
        values = df.values.tolist()
        
        query = f"INSERT INTO {table} ({columns}) VALUES"
        self.client.execute(query, values)
    
    def query_to_dataframe(self, query: str) -> pd.DataFrame:
        """Execute query and return DataFrame."""
        result = self.client.execute(query, with_column_types=True)
        columns = [col[0] for col in result[1]]
        return pd.DataFrame(result[0], columns=columns)
    
    # Real-time aggregation example
    def create_realtime_view(self):
        """Create materialized view for real-time aggregation."""
        
        query = """
        CREATE MATERIALIZED VIEW hourly_events_mv
        ENGINE = SummingMergeTree()
        ORDER BY (event_type, hour)
        AS SELECT
            toStartOfHour(event_time) AS hour,
            event_type,
            count() AS events,
            sum(revenue) AS revenue
        FROM events
        GROUP BY hour, event_type
        """
        self.client.execute(query)

Apache Druid

Architecture

Druid combines column storage with automatic partitioning and indexing, optimized for time-series and event data.

Ingestion Configuration

# druid ingestion_spec.yaml
ioConfig:
  type: kafka
  consumerProps:
    bootstrap.servers: "kafka:9092"
    group.id: "druid-consumer"
    auto.offset.reset: "earliest"

dataSchema:
  dataSource: "events"
  timestampSpec:
    column: "event_time"
    format: "auto"
  
  granularitySpec:
    type: "uniform"
    segmentGranularity: "hour"
    queryGranularity: "minute"
  
  dimensionsSpec:
    dimensions:
      - "event_type"
      - "platform"
      - "country"
      - {type: "string", name: "user_id", createBitmapIndex: false}
  
  metricsSpec:
    - {type: "count", name: "count"}
    - {type: "doubleSum", name: "revenue_sum", fieldName: "revenue"}
    - {type: "hyperUnique", name: "unique_users", fieldName: "user_id"}

tuningConfig:
  type: "kafka"
  maxRowsPerSegment: 5000000

Queries

-- Time-series query
SELECT
    TIME_FLOOR(event_time, 'PT1H') AS hour,
    event_type,
    count,
    sum(revenue_sum) AS revenue
FROM events
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1, 2
ORDER BY hour DESC

-- TopN query
SELECT
    event_type,
    count
FROM events
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY event_type
ORDER BY count DESC
LIMIT 10

-- Timeseries with filtering
SELECT
    TIME_FLOOR(event_time, 'PT5M') AS five_min,
    count,
    sum(revenue_sum) AS revenue
FROM events
WHERE event_type IN ('purchase', 'subscription')
AND country = 'US'
GROUP BY 1
ORDER BY five_min DESC

Materialized Views

PostgreSQL Materialized Views

-- Create materialized view
CREATE MATERIALIZED VIEW hourly_sales_mv AS
SELECT
    DATE_TRUNC('hour', created_at) AS hour,
    product_category,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1, 2
WITH DATA;

-- Create index for performance
CREATE INDEX idx_hourly_sales_mv_hour 
ON hourly_sales_mv (hour DESC);

-- Refresh materialized view
REFRESH MATERIALIZED VIEW CONCURRENTLY hourly_sales_mv;

-- Schedule refresh
-- Add to cron: 0 * * * * psql -c "REFRESH MATERIALIZED VIEW hourly_sales_mv;"

Snowflake Materialized Views

-- Snowflake automatically manages materialized views
CREATE MATERIALIZED VIEW mv_hourly_revenue
AS
SELECT
    DATE_TRUNC('HOUR', created_at) AS hour,
    region,
    SUM(revenue) AS total_revenue,
    COUNT(*) AS transaction_count
FROM transactions
WHERE created_at >= DATE_TRUNC('YEAR', CURRENT_DATE())
GROUP BY 1, 2;

-- Materialized view automatically refreshes
-- Note: Limitations on aggregations and joins

Implementation Checklist

Choose technology based on use case
Design data model for analytical queries
Configure ingestion pipeline
Set up auto-refresh for materialized views
Implement query optimization
Configure data retention policies

Summary

Real-time analytics requires careful technology selection:

ClickHouse: Best for ad-hoc analytical queries with large volumes
Druid: Ideal for time-series event data with high cardinality
Materialized Views: Query acceleration for predictable workloads

All three approaches can achieve sub-second query performance when properly implemented.

Real-time Analytics: ClickHouse, Druid, and Materialized Views

Introduction

Technology Comparison

ClickHouse

Architecture

Table Design

Queries and Aggregations

Python Integration

Apache Druid

Architecture

Ingestion Configuration

Queries

Materialized Views

PostgreSQL Materialized Views

Snowflake Materialized Views

Implementation Checklist

Summary

External Resources

Comments