Introduction
DuckDB continues to evolve rapidly in 2025-2026, with major releases bringing powerful new features including vector similarity search, improved performance, and expanded ecosystem support. The database has seen massive adoption growth, becoming the standard for embedded analytics.
This article explores the latest DuckDB developments, new features, and emerging patterns in the DuckDB ecosystem.
Version 1.x Releases
Latest Stable: v1.4.4 (February 2026)
# Installation
pip install duckdb==1.4.4
# Verify version
duckdb --version
Key Features in v1.x
-- 1. Improved ARRAY type (v0.10+)
CREATE TABLE vectors (
id INTEGER,
embedding DOUBLE[1536]
);
-- 2. Better statistics
SET enable_statistics = true;
-- 3. Enhanced window functions
SELECT
LAG(value) OVER (ORDER BY date) as prev_value,
LEAD(value) OVER (ORDER BY date) as next_value
FROM metrics;
Performance Improvements
-- Faster aggregations
-- Query: 100M rows aggregation
-- v1.3: ~2.5s
-- v1.4: ~1.8s (28% faster)
-- Better parallel scanning
-- Up to 3x faster on multi-core systems
Vector Similarity Search
VSS Extension
DuckDB v0.10+ introduced vector similarity search support through the vss extension:
-- Install vector search extension
INSTALL vss;
LOAD vss;
-- Create vector table
CREATE TABLE document_embeddings (
id INTEGER,
document TEXT,
embedding FLOAT[384]
);
-- Create HNSW index
CREATE INDEX idx_embedding ON document_embeddings
USING HNSW (embedding);
-- Insert vector data
INSERT INTO document_embeddings VALUES
(1, 'Python tutorial', [0.1, 0.2, 0.3, ...]),
(2, 'JavaScript guide', [0.2, 0.1, 0.4, ...]);
Similarity Search Queries
-- Cosine similarity search
SELECT
id,
document,
array_cosine_distance(embedding, [0.1, 0.2, 0.3, ...]) as distance
FROM document_embeddings
ORDER BY distance
LIMIT 5;
-- Using HNSW index (faster)
SELECT
id,
document,
array_cosine_distance(embedding, [0.1, 0.2, 0.3, ...]) as distance
FROM document_embeddings
WHERE array_cosine_distance(embedding, [0.1, 0.2, 0.3, ...]) < 0.5
LIMIT 5;
Python Integration with Vectors
import duckdb
import numpy as np
from sentence_transformers import SentenceTransformer
# Enable vss extension
con = duckdb.connect()
con.execute("INSTALL vss")
con.execute("LOAD vss")
# Create table
con.execute("""
CREATE TABLE embeddings (
id INTEGER,
text TEXT,
vector FLOAT[384]
)
""")
# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = ["Python tutorial", "JavaScript guide"]
for i, text in enumerate(texts):
vector = model.encode(text)
con.execute(
"INSERT INTO embeddings VALUES (?, ?, ?)",
(i, text, list(vector))
)
# Search
query = "programming language tutorial"
query_vector = model.encode(query)
result = con.execute("""
SELECT id, text, array_cosine_distance(vector, ?) as distance
FROM embeddings
ORDER BY distance
LIMIT 5
""", [list(query_vector)]).fetchall()
print(result)
Extension Ecosystem
Popular Extensions
-- 1. JSON extension
INSTALL json;
LOAD json;
-- 2. Spatial/GIS extension
INSTALL spatial;
LOAD spatial;
-- 3. HTTP file system
INSTALL httpfs;
LOAD httpfs;
-- 4. MySQL scanner
INSTALL mysql_scanner;
LOAD mysql_scanner;
-- 5. PostgreSQL scanner
INSTALL postgres_scanner;
LOAD postgres_scanner;
Reading from S3
-- Install HTTPFS for S3 support
INSTALL httpfs;
LOAD httpfs;
-- Configure S3 (via environment or config)
SET s3_access_key_id='your_key';
SET s3_secret_access_key='your_secret';
SET s3_region='us-east-1';
-- Query directly from S3
SELECT * FROM read_parquet('s3://bucket/data/*.parquet');
Reading from Databases
-- MySQL
INSTALL mysql_scanner;
LOAD mysql_scanner;
ATTACH 'mysql://user:pass@localhost/mydb' AS mysql_db;
SELECT * FROM mysql_db.mysql_table LIMIT 100;
-- PostgreSQL
INSTALL postgres_scanner;
LOAD postgres_scanner;
ATTACH 'postgresql://user:pass@localhost/mydb' AS pg_db;
SELECT * FROM pg_db.pg_table LIMIT 100;
Python Ecosystem Integration
Pandas Integration
import duckdb
import pandas as pd
# Pandas to DuckDB
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
con = duckdb.connect()
con.execute("CREATE TABLE test AS SELECT * FROM df")
# DuckDB to Pandas
df = con.execute("SELECT * FROM test").df()
# Directly query DataFrame
con.execute("SELECT * FROM df WHERE a > 1").df()
Polars Integration
import duckdb
import polars as pl
# Polars to DuckDB
df = pl.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
con = duckdb.connect()
con.execute("CREATE TABLE test AS SELECT * FROM df")
# DuckDB to Polars
df = con.execute("SELECT * FROM test").pl()
# Query Polars DataFrame
con.execute("SELECT * FROM df WHERE a > 1").pl()
DBT Integration
# Install dbt-duckdb
pip install dbt-duckdb
# Create dbt project
dbt init analytics_project
# Configure profiles.yml
analytics_project:
outputs:
dev:
type: duckdb
path: analytics.db
threads: 4
# dbt model example
# models/staging/staging_orders.sql
SELECT
order_id,
customer_id,
amount,
created_at
FROM read_csv_auto('orders.csv')
WHERE status = 'completed'
Cloud and Serverless
DuckDB in Serverless Functions
# AWS Lambda with DuckDB
import duckdb
import boto3
import json
import io
def handler(event, context):
# Download data from S3
s3 = boto3.client('s3')
response = s3.get_object(Bucket='data-bucket', Key='events.csv')
# Load into DuckDB
conn = duckdb.connect(':memory:')
conn.execute("INSTALL httpfs")
conn.execute("LOAD httpfs")
# Query
result = conn.execute("""
SELECT date_trunc('day', event_time) as day,
COUNT(*) as events
FROM read_csv_auto('/tmp/events.csv')
GROUP BY 1
ORDER BY 1
""").fetchall()
return {
'statusCode': 200,
'body': json.dumps(result)
}
DuckDB + MotherDuck
# MotherDuck - Managed DuckDB cloud service
# (Check latest status of MotherDuck service)
# Connect to MotherDuck
# import motherduck
# con = motherduck.connect('md:my_database')
# Query shared datasets
# result = con.execute("SELECT * FROM shared_data.table").df()
Data Science Workflows
Jupyter Integration
# In Jupyter notebook
%pip install duckdb
import duckdb
import pandas as pd
# Create connection
con = duckdb.connect('analytics.duckdb')
# Query and display
%sql SELECT * FROM events LIMIT 10
# Load large dataset efficiently
df = con.execute("""
SELECT *
FROM read_parquet('s3://data/*.parquet')
WHERE date > '2025-01-01'
""").df()
# Close connection
con.close()
ML Feature Engineering
import duckdb
import numpy as np
from sklearn.model_selection import train_test_split
# Create features in DuckDB
con = duckdb.connect('features.duckdb')
con.execute("""
CREATE TABLE features AS
SELECT
user_id,
COUNT(*) as total_events,
AVG(duration) as avg_duration,
SUM(CASE WHEN action = 'purchase' THEN 1 ELSE 0 END) as purchase_count,
MAX(event_time) as last_event
FROM events
GROUP BY user_id
""")
# Export to NumPy
X = con.execute("SELECT * FROM features WHERE user_id IS NOT NULL").fetchall()
y = con.execute("SELECT label FROM features WHERE user_id IS NOT NULL").fetchall()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Best Practices for 2026
Recommended Configuration
# Production DuckDB setup
config = {
'memory_limit': '8GB',
'threads': 8,
'checkpoint_threshold': '16GB',
'enable_profiling': 'query_tree',
}
New Patterns
-- Use CTEs for complex queries
WITH base AS (
SELECT * FROM read_csv_auto('data.csv')
),
filtered AS (
SELECT * FROM base WHERE date > '2025-01-01'
)
SELECT
category,
COUNT(*) as count,
SUM(amount) as total
FROM filtered
GROUP BY category;
-- Use window functions for analytics
SELECT *,
AVG(salary) OVER (PARTITION BY department) as dept_avg,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;
Community and Ecosystem
Growing Ecosystem
- dbt-duckdb: dbt adapter for DuckDB
- DuckDB-Wasm: Browser-based DuckDB
- MotherDuck: Cloud-managed DuckDB
- Cheatography: DuckDB cheat sheet
- Awesome DuckDB: Curated resources
Learning Resources
Resources
Conclusion
DuckDB in 2025-2026 represents a significant evolution from its origins as a simple embedded database. With vector search, improved extensions, and growing cloud support, DuckDB is well-positioned for both traditional analytics and emerging AI workloads.
In the next article, we’ll explore AI applications with DuckDB, including vector search integration, ML pipelines, and RAG implementations.
Comments