Introduction
ClickHouse continues to evolve rapidly in 2025-2026, with major developments including vector similarity search, enhanced AI integrations, and significant cloud growth. The company has raised billions in funding and is expanding beyond traditional analytics into AI and machine learning workloads.
This article explores the latest ClickHouse developments, new features, and emerging patterns in the ClickHouse ecosystem.
Latest Releases
Version 25.x Features
-- Vector similarity search (Beta in 25.x)
-- Index creation
ALTER TABLE embeddings ADD INDEX vec_idx embedding
TYPE vector_similarity
GRANULARITY 1;
-- Query with vector similarity
SELECT id, text,
distance(embedding, [0.1, 0.2, ...]) as dist
FROM embeddings
ORDER BY dist
LIMIT 10;
-- Iceberg table support (improved)
CREATE TABLE iceberg_table (
id UInt64,
data String
) ENGINE = Iceberg(
's3://bucket/iceberg/warehouse',
'test_table'
);
Performance Improvements
-- Better query parallelism
-- Faster merges
-- Improved memory management
-- Check version
SELECT version();
-- New functions
SELECT formatRow('JSONEachRow', number, 'test') FROM numbers(10);
Vector Similarity Search
Setting Up Vector Search
-- Enable vector similarity index (25.x)
-- Install if needed
-- SET allow_experimental_vector_similarity_index = 1;
-- Create vector table
CREATE TABLE document_embeddings (
id UInt64,
document_id UInt64,
text String,
embedding Array(Float32) -- 384 dimensions
) ENGINE = MergeTree()
ORDER BY document_id;
-- Create vector index
ALTER TABLE document_embeddings
ADD INDEX vec_idx embedding
TYPE vector_similarity('metric=cosine')
GRANULARITY 1;
Python Vector Search
import clickhouse_connect
import numpy as np
from sentence_transformers import SentenceTransformer
# Connect
client = clickhouse_connect.get_client(
host='localhost',
port=8123
)
# Create table
client.command("""
CREATE TABLE IF NOT EXISTS embeddings (
id UInt64,
text String,
vector Array(Float32)
) ENGINE = MergeTree()
ORDER BY id
""")
# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [
"Python is a programming language",
"Machine learning is AI",
"Data science combines statistics and programming"
]
for i, text in enumerate(docs):
vector = model.encode(text).tolist()
client.command(
"INSERT INTO embeddings VALUES",
[[i, text, vector]]
)
# Search
query = "artificial intelligence"
query_vector = model.encode(query).tolist()
results = client.query(f"""
SELECT id, text,
arrayCosineDistance(vector, {query_vector}) as distance
FROM embeddings
ORDER BY distance
LIMIT 5
""")
for row in results.result_rows:
print(f"ID: {row[0]}, Text: {row[1]}, Distance: {row[2]:.4f}")
AI Integration
LangChain Integration
# Using LangChain with ClickHouse
from langchain_community.vectorstores import Clickhouse
from langchain_community.embeddings import SentenceTransformerEmbeddings
# Create embeddings
embeddings = SentenceTransformerEmbeddings(
model_name="all-MiniLM-L6-v2"
)
# Create vector store
vector_store = Clickhouse(
embedding=embeddings,
host="localhost",
port=8123
)
# Add documents
from langchain.schema import Document
docs = [
Document(page_content="Python tutorial", metadata={"source": "tutorial"}),
Document(page_content="ML guide", metadata={"source": "guide"})
]
vector_store.add_documents(docs)
# Similarity search
results = vector_store.similarity_search("programming", k=2)
print(results)
RAG Pipeline
import clickhouse_connect
import openai
from sentence_transformers import SentenceTransformer
class ClickHouseRAG:
"""RAG with ClickHouse vector storage."""
def __init__(self, config):
self.client = clickhouse_connect.get_client(**config)
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self._init_tables()
def _init_tables(self):
"""Initialize tables."""
self.client.command("""
CREATE TABLE IF NOT EXISTS chunks (
id UInt64,
document_id UInt64,
chunk_text String,
vector Array(Float32)
) ENGINE = MergeTree()
ORDER BY (document_id, id)
""")
def ingest(self, document_id, text, chunk_size=500):
"""Ingest document with embeddings."""
chunks = [text[i:i+chunk_size]
for i in range(0, len(text), chunk_size)]
for i, chunk in enumerate(chunks):
vector = self.model.encode(chunk).tolist()
self.client.command(
"INSERT INTO chunks VALUES",
[[i, document_id, chunk, vector]]
)
def retrieve(self, query, top_k=5):
"""Retrieve relevant chunks."""
query_vector = self.model.encode(query).tolist()
results = self.client.query(f"""
SELECT chunk_text,
arrayCosineDistance(vector, {query_vector}) as distance
FROM chunks
ORDER BY distance
LIMIT {top_k}
""")
return [r[0] for r in results.result_rows]
def answer(self, question):
"""Generate answer with RAG."""
chunks = self.retrieve(question)
context = "\n\n".join(chunks)
prompt = f"""Context: {context}
Question: {question}
Answer:"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Usage
rag = ClickHouseRAG({'host': 'localhost', 'port': 8123})
rag.ingest(1, "Your long document text here...")
answer = rag.answer("What is the main topic?")
print(answer)
Cloud and Managed Services
ClickHouse Cloud
# ClickHouse Cloud (managed service)
# - Auto-scaling
# - Managed replication
# - Built-in backup
# Connect to cloud
# 1. Sign up at clickhouse.cloud
# 2. Get connection string
# 3. Connect
clickhouse-client \
--host CH-host.clickhouse.cloud \
--port 8443 \
--secure \
--user default \
--password your_password
Self-Hosted Optimizations
<!-- Production config -->
<clickhouse>
<!-- Efficient resource usage -->
<max_server_memory_usage>32G</max_server_memory_usage>
<max_concurrent_queries>100</max_concurrent_queries>
<!-- Fast compression -->
<compression>
<case>
<method>zstd</method>
<level>3</level>
</case>
</compression>
<!-- Merge optimization -->
<max_parts_to_merge_at_once>50</max_parts_to_merge_at_once>
<merge_selector>adaptive</merge_selector>
</clickhouse>
New SQL Features
Enhanced Functions
-- JSON improvements
SELECT JSONExtractAllKeys('{"a":1,"b":2}'); -- ['a','b']
-- Map functions
SELECT map('a', 1, 'b', 2);
-- Time series functions
SELECT runningDifference([1, 2, 3, 4]); -- [0,1,1,1]
SELECT groupArrayMovingSum([1, 2, 3, 4]); -- [1,3,6,10]
-- SQL standard functions
SELECT * FROM table WHERE column IS NOT DISTINCT FROM NULL;
Table Functions
-- S3 table function
SELECT * FROM s3('s3://bucket/*.parquet');
-- Remote table function
SELECT * FROM remote('localhost', 'database', 'table');
-- Generate series
SELECT range(10); -- [0,1,2,...,9]
SELECT sequence(10, 20, 2); -- [10,12,14,16,18,20]
Ecosystem Growth
Growing Tools
- ClickHouse Keeper: ZooKeeper replacement
- ClickHouse Hub: Docker/Kubernetes operator
- Tabix: Web UI
- ClickVisual: Visualization
- Metrica: Monitoring
Integration Partners
- dbt-clickhouse: dbt adapter
- Airbyte: Data integration
- Fivetran: ELT pipelines
- Kafka: Streaming ingestion
- S3: Object storage
Best Practices for 2026
Recommended Configuration
<!-- Production-ready settings -->
<clickhouse>
<!-- Memory -->
<max_server_memory_usage>32G</max_server_memory_usage>
<max_bytes_before_external_group_by>8G</max_bytes_before_external_group_by>
<!-- Parallelism -->
<max_threads>16</max_threads>
<max_parallel_replicas>4</max_parallel_replicas>
<!-- Compression -->
<compression>
<case>
<method>zstd</method>
<level>3</level>
</case>
</compression>
<!-- Merges -->
<merge_selector>adaptive</merge_selector>
<max_parts_to_merge_at_once>100</max_parts_to_merge_at_once>
</clickhouse>
Migration Tips
-- From other databases:
-- 1. Export to Parquet/CSV
-- 2. Create ClickHouse tables
-- 3. Import with INSERT
-- Performance tuning:
-- 1. Partition by time
-- 2. Choose good primary key
-- 3. Add secondary indexes
-- 4. Use materialized views
Future Roadmap
Expected Developments
- Enhanced vector search: More index types, better performance
- Cloud features: More managed capabilities
- AI/ML: Better integration with ML pipelines
- Performance: Continued optimization
- Security: Enhanced access control
Resources
Conclusion
ClickHouse in 2025-2026 is expanding beyond traditional analytics into AI and ML workloads. With vector search, better cloud integration, and continued performance improvements, ClickHouse is well-positioned for the next generation of data applications.
In the next article, we’ll explore AI applications with ClickHouse, including vector search, RAG pipelines, and ML feature engineering.
Comments