ClickHouse Trends 2025-2026: New Features, Vector Search, and Cloud Evolution

Introduction

ClickHouse continues to evolve rapidly in 2025-2026, with major developments including vector similarity search, enhanced AI integrations, and significant cloud growth. The company has raised billions in funding and is expanding beyond traditional analytics into AI and machine learning workloads.

This article explores the latest ClickHouse developments, new features, and emerging patterns in the ClickHouse ecosystem.

Latest Releases

Version 25.x Features

-- Vector similarity search (Beta in 25.x)
-- Index creation
ALTER TABLE embeddings ADD INDEX vec_idx embedding 
TYPE vector_similarity 
GRANULARITY 1;

-- Query with vector similarity
SELECT id, text, 
    distance(embedding, [0.1, 0.2, ...]) as dist
FROM embeddings
ORDER BY dist
LIMIT 10;

-- Iceberg table support (improved)
CREATE TABLE iceberg_table (
    id UInt64,
    data String
) ENGINE = Iceberg(
    's3://bucket/iceberg/warehouse',
    'test_table'
);

Performance Improvements

-- Better query parallelism
-- Faster merges
-- Improved memory management

-- Check version
SELECT version();

-- New functions
SELECT formatRow('JSONEachRow', number, 'test') FROM numbers(10);

Vector Similarity Search

Setting Up Vector Search

-- Enable vector similarity index (25.x)
-- Install if needed
-- SET allow_experimental_vector_similarity_index = 1;

-- Create vector table
CREATE TABLE document_embeddings (
    id UInt64,
    document_id UInt64,
    text String,
    embedding Array(Float32)  -- 384 dimensions
) ENGINE = MergeTree()
ORDER BY document_id;

-- Create vector index
ALTER TABLE document_embeddings 
ADD INDEX vec_idx embedding 
TYPE vector_similarity('metric=cosine')
GRANULARITY 1;

Python Vector Search

import clickhouse_connect
import numpy as np
from sentence_transformers import SentenceTransformer

# Connect
client = clickhouse_connect.get_client(
    host='localhost',
    port=8123
)

# Create table
client.command("""
    CREATE TABLE IF NOT EXISTS embeddings (
        id UInt64,
        text String,
        vector Array(Float32)
    ) ENGINE = MergeTree()
    ORDER BY id
""")

# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [
    "Python is a programming language",
    "Machine learning is AI",
    "Data science combines statistics and programming"
]

for i, text in enumerate(docs):
    vector = model.encode(text).tolist()
    client.command(
        "INSERT INTO embeddings VALUES",
        [[i, text, vector]]
    )

# Search
query = "artificial intelligence"
query_vector = model.encode(query).tolist()

results = client.query(f"""
    SELECT id, text, 
        arrayCosineDistance(vector, {query_vector}) as distance
    FROM embeddings
    ORDER BY distance
    LIMIT 5
""")

for row in results.result_rows:
    print(f"ID: {row[0]}, Text: {row[1]}, Distance: {row[2]:.4f}")

AI Integration

LangChain Integration

# Using LangChain with ClickHouse
from langchain_community.vectorstores import Clickhouse
from langchain_community.embeddings import SentenceTransformerEmbeddings

# Create embeddings
embeddings = SentenceTransformerEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# Create vector store
vector_store = Clickhouse(
    embedding=embeddings,
    host="localhost",
    port=8123
)

# Add documents
from langchain.schema import Document
docs = [
    Document(page_content="Python tutorial", metadata={"source": "tutorial"}),
    Document(page_content="ML guide", metadata={"source": "guide"})
]

vector_store.add_documents(docs)

# Similarity search
results = vector_store.similarity_search("programming", k=2)
print(results)

RAG Pipeline

import clickhouse_connect
import openai
from sentence_transformers import SentenceTransformer

class ClickHouseRAG:
    """RAG with ClickHouse vector storage."""
    
    def __init__(self, config):
        self.client = clickhouse_connect.get_client(**config)
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self._init_tables()
    
    def _init_tables(self):
        """Initialize tables."""
        self.client.command("""
            CREATE TABLE IF NOT EXISTS chunks (
                id UInt64,
                document_id UInt64,
                chunk_text String,
                vector Array(Float32)
            ) ENGINE = MergeTree()
            ORDER BY (document_id, id)
        """)
    
    def ingest(self, document_id, text, chunk_size=500):
        """Ingest document with embeddings."""
        chunks = [text[i:i+chunk_size] 
                  for i in range(0, len(text), chunk_size)]
        
        for i, chunk in enumerate(chunks):
            vector = self.model.encode(chunk).tolist()
            self.client.command(
                "INSERT INTO chunks VALUES",
                [[i, document_id, chunk, vector]]
            )
    
    def retrieve(self, query, top_k=5):
        """Retrieve relevant chunks."""
        query_vector = self.model.encode(query).tolist()
        
        results = self.client.query(f"""
            SELECT chunk_text,
                arrayCosineDistance(vector, {query_vector}) as distance
            FROM chunks
            ORDER BY distance
            LIMIT {top_k}
        """)
        
        return [r[0] for r in results.result_rows]
    
    def answer(self, question):
        """Generate answer with RAG."""
        chunks = self.retrieve(question)
        
        context = "\n\n".join(chunks)
        
        prompt = f"""Context: {context}

Question: {question}

Answer:"""
        
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content

# Usage
rag = ClickHouseRAG({'host': 'localhost', 'port': 8123})
rag.ingest(1, "Your long document text here...")
answer = rag.answer("What is the main topic?")
print(answer)

Cloud and Managed Services

ClickHouse Cloud

# ClickHouse Cloud (managed service)
# - Auto-scaling
# - Managed replication
# - Built-in backup

# Connect to cloud
# 1. Sign up at clickhouse.cloud
# 2. Get connection string
# 3. Connect

clickhouse-client \
    --host CH-host.clickhouse.cloud \
    --port 8443 \
    --secure \
    --user default \
    --password your_password

Self-Hosted Optimizations

<!-- Production config -->
<clickhouse>
    <!-- Efficient resource usage -->
    <max_server_memory_usage>32G</max_server_memory_usage>
    <max_concurrent_queries>100</max_concurrent_queries>
    
    <!-- Fast compression -->
    <compression>
        <case>
            <method>zstd</method>
            <level>3</level>
        </case>
    </compression>
    
    <!-- Merge optimization -->
    <max_parts_to_merge_at_once>50</max_parts_to_merge_at_once>
    <merge_selector>adaptive</merge_selector>
</clickhouse>

New SQL Features

Enhanced Functions

-- JSON improvements
SELECT JSONExtractAllKeys('{"a":1,"b":2}');  -- ['a','b']

-- Map functions
SELECT map('a', 1, 'b', 2);

-- Time series functions
SELECT runningDifference([1, 2, 3, 4]);  -- [0,1,1,1]
SELECT groupArrayMovingSum([1, 2, 3, 4]);  -- [1,3,6,10]

-- SQL standard functions
SELECT * FROM table WHERE column IS NOT DISTINCT FROM NULL;

Table Functions

-- S3 table function
SELECT * FROM s3('s3://bucket/*.parquet');

-- Remote table function
SELECT * FROM remote('localhost', 'database', 'table');

-- Generate series
SELECT range(10);  -- [0,1,2,...,9]
SELECT sequence(10, 20, 2);  -- [10,12,14,16,18,20]

Ecosystem Growth

Growing Tools

ClickHouse Keeper: ZooKeeper replacement
ClickHouse Hub: Docker/Kubernetes operator
Tabix: Web UI
ClickVisual: Visualization
Metrica: Monitoring

Integration Partners

dbt-clickhouse: dbt adapter
Airbyte: Data integration
Fivetran: ELT pipelines
Kafka: Streaming ingestion
S3: Object storage

Best Practices for 2026

Recommended Configuration

<!-- Production-ready settings -->
<clickhouse>
    <!-- Memory -->
    <max_server_memory_usage>32G</max_server_memory_usage>
    <max_bytes_before_external_group_by>8G</max_bytes_before_external_group_by>
    
    <!-- Parallelism -->
    <max_threads>16</max_threads>
    <max_parallel_replicas>4</max_parallel_replicas>
    
    <!-- Compression -->
    <compression>
        <case>
            <method>zstd</method>
            <level>3</level>
        </case>
    </compression>
    
    <!-- Merges -->
    <merge_selector>adaptive</merge_selector>
    <max_parts_to_merge_at_once>100</max_parts_to_merge_at_once>
</clickhouse>

Migration Tips

-- From other databases:
-- 1. Export to Parquet/CSV
-- 2. Create ClickHouse tables
-- 3. Import with INSERT

-- Performance tuning:
-- 1. Partition by time
-- 2. Choose good primary key
-- 3. Add secondary indexes
-- 4. Use materialized views

Future Roadmap

Expected Developments

Enhanced vector search: More index types, better performance
Cloud features: More managed capabilities
AI/ML: Better integration with ML pipelines
Performance: Continued optimization
Security: Enhanced access control

Resources

Conclusion

ClickHouse in 2025-2026 is expanding beyond traditional analytics into AI and ML workloads. With vector search, better cloud integration, and continued performance improvements, ClickHouse is well-positioned for the next generation of data applications.

In the next article, we’ll explore AI applications with ClickHouse, including vector search, RAG pipelines, and ML feature engineering.