Introduction
Real-time analytics enables organizations to analyze data as it arrives, supporting use cases from operational dashboards to fraud detection. This guide covers leading technologies and implementation patterns.
Technology Comparison
| Aspect | ClickHouse | Apache Druid | Materialized Views |
|---|---|---|---|
| Architecture | Column-oriented DB | Column-oriented + inverted index | Query acceleration |
| Latency | Sub-second | Sub-second | Sub-second |
| Scalability | Petabyte+ | Petabyte+ | Depends on base DB |
| Query Model | SQL | SQL | Pre-computed |
| Ingestion | Batch + Streaming | Streaming preferred | Batch |
| Best For | Ad-hoc analytics | Time-series, events | Known queries |
ClickHouse
Architecture
ClickHouse is a column-oriented database optimized for analytical queries. It uses distributed architecture with automatic sharding and replication.
Table Design
-- Create distributed table
CREATE TABLE events
(
event_time DateTime,
event_type String,
user_id UInt64,
platform String,
country String,
revenue Float32,
properties JSON
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, event_time, user_id)
TTL event_time + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;
-- Create distributed version
CREATE TABLE events_distributed
AS events
ENGINE = Distributed('cluster', 'default', 'events', rand());
Queries and Aggregations
-- Time-series aggregation
SELECT
toStartOfHour(event_time) AS hour,
event_type,
count() AS events,
uniqExact(user_id) AS unique_users,
sum(revenue) AS total_revenue
FROM events
WHERE event_time >= now() - INTERVAL 24 HOUR
GROUP BY hour, event_type
ORDER BY hour DESC;
-- Rolling window analysis
SELECT
event_type,
count() / 24 AS events_per_hour_avg,
quantile(0.95)(revenue) AS revenue_p95
FROM events
WHERE event_time >= now() - INTERVAL 7 DAY
GROUP BY event_type;
-- User funnel analysis
SELECT
countIf(event_type = 'page_view') AS page_views,
countIf(event_type = 'add_to_cart') AS add_to_carts,
countIf(event_type = 'purchase') AS purchases,
purchases / page_views AS conversion_rate
FROM events
WHERE event_time >= now() - INTERVAL 30 DAY;
Python Integration
# clickhouse_client.py
from clickhouse_driver import Client
import pandas as pd
class ClickHouseAnalytics:
def __init__(self, host, port, database):
self.client = Client(host=host, port=port, database=database)
def execute(self, query: str) -> list:
"""Execute query and return results."""
return self.client.execute(query)
def insert_dataframe(self, df: pd.DataFrame, table: str):
"""Insert DataFrame into ClickHouse."""
columns = ', '.join(df.columns)
values = df.values.tolist()
query = f"INSERT INTO {table} ({columns}) VALUES"
self.client.execute(query, values)
def query_to_dataframe(self, query: str) -> pd.DataFrame:
"""Execute query and return DataFrame."""
result = self.client.execute(query, with_column_types=True)
columns = [col[0] for col in result[1]]
return pd.DataFrame(result[0], columns=columns)
# Real-time aggregation example
def create_realtime_view(self):
"""Create materialized view for real-time aggregation."""
query = """
CREATE MATERIALIZED VIEW hourly_events_mv
ENGINE = SummingMergeTree()
ORDER BY (event_type, hour)
AS SELECT
toStartOfHour(event_time) AS hour,
event_type,
count() AS events,
sum(revenue) AS revenue
FROM events
GROUP BY hour, event_type
"""
self.client.execute(query)
Apache Druid
Architecture
Druid combines column storage with automatic partitioning and indexing, optimized for time-series and event data.
Ingestion Configuration
# druid ingestion_spec.yaml
ioConfig:
type: kafka
consumerProps:
bootstrap.servers: "kafka:9092"
group.id: "druid-consumer"
auto.offset.reset: "earliest"
dataSchema:
dataSource: "events"
timestampSpec:
column: "event_time"
format: "auto"
granularitySpec:
type: "uniform"
segmentGranularity: "hour"
queryGranularity: "minute"
dimensionsSpec:
dimensions:
- "event_type"
- "platform"
- "country"
- {type: "string", name: "user_id", createBitmapIndex: false}
metricsSpec:
- {type: "count", name: "count"}
- {type: "doubleSum", name: "revenue_sum", fieldName: "revenue"}
- {type: "hyperUnique", name: "unique_users", fieldName: "user_id"}
tuningConfig:
type: "kafka"
maxRowsPerSegment: 5000000
Queries
-- Time-series query
SELECT
TIME_FLOOR(event_time, 'PT1H') AS hour,
event_type,
count,
sum(revenue_sum) AS revenue
FROM events
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1, 2
ORDER BY hour DESC
-- TopN query
SELECT
event_type,
count
FROM events
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY event_type
ORDER BY count DESC
LIMIT 10
-- Timeseries with filtering
SELECT
TIME_FLOOR(event_time, 'PT5M') AS five_min,
count,
sum(revenue_sum) AS revenue
FROM events
WHERE event_type IN ('purchase', 'subscription')
AND country = 'US'
GROUP BY 1
ORDER BY five_min DESC
Materialized Views
PostgreSQL Materialized Views
-- Create materialized view
CREATE MATERIALIZED VIEW hourly_sales_mv AS
SELECT
DATE_TRUNC('hour', created_at) AS hour,
product_category,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1, 2
WITH DATA;
-- Create index for performance
CREATE INDEX idx_hourly_sales_mv_hour
ON hourly_sales_mv (hour DESC);
-- Refresh materialized view
REFRESH MATERIALIZED VIEW CONCURRENTLY hourly_sales_mv;
-- Schedule refresh
-- Add to cron: 0 * * * * psql -c "REFRESH MATERIALIZED VIEW hourly_sales_mv;"
Snowflake Materialized Views
-- Snowflake automatically manages materialized views
CREATE MATERIALIZED VIEW mv_hourly_revenue
AS
SELECT
DATE_TRUNC('HOUR', created_at) AS hour,
region,
SUM(revenue) AS total_revenue,
COUNT(*) AS transaction_count
FROM transactions
WHERE created_at >= DATE_TRUNC('YEAR', CURRENT_DATE())
GROUP BY 1, 2;
-- Materialized view automatically refreshes
-- Note: Limitations on aggregations and joins
Implementation Checklist
- Choose technology based on use case
- Design data model for analytical queries
- Configure ingestion pipeline
- Set up auto-refresh for materialized views
- Implement query optimization
- Configure data retention policies
Summary
Real-time analytics requires careful technology selection:
- ClickHouse: Best for ad-hoc analytical queries with large volumes
- Druid: Ideal for time-series event data with high cardinality
- Materialized Views: Query acceleration for predictable workloads
All three approaches can achieve sub-second query performance when properly implemented.
Comments