Skip to main content
โšก Calmops

DuckDB Trends 2025-2026: New Features, Extensions, and Emerging Use Cases

Introduction

DuckDB continues to evolve rapidly in 2025-2026, with major releases bringing powerful new features including vector similarity search, improved performance, and expanded ecosystem support. The database has seen massive adoption growth, becoming the standard for embedded analytics.

This article explores the latest DuckDB developments, new features, and emerging patterns in the DuckDB ecosystem.


Version 1.x Releases

Latest Stable: v1.4.4 (February 2026)

# Installation
pip install duckdb==1.4.4

# Verify version
duckdb --version

Key Features in v1.x

-- 1. Improved ARRAY type (v0.10+)
CREATE TABLE vectors (
    id INTEGER,
    embedding DOUBLE[1536]
);

-- 2. Better statistics
SET enable_statistics = true;

-- 3. Enhanced window functions
SELECT 
    LAG(value) OVER (ORDER BY date) as prev_value,
    LEAD(value) OVER (ORDER BY date) as next_value
FROM metrics;

Performance Improvements

-- Faster aggregations
-- Query: 100M rows aggregation
-- v1.3: ~2.5s
-- v1.4: ~1.8s (28% faster)

-- Better parallel scanning
-- Up to 3x faster on multi-core systems

VSS Extension

DuckDB v0.10+ introduced vector similarity search support through the vss extension:

-- Install vector search extension
INSTALL vss;
LOAD vss;

-- Create vector table
CREATE TABLE document_embeddings (
    id INTEGER,
    document TEXT,
    embedding FLOAT[384]
);

-- Create HNSW index
CREATE INDEX idx_embedding ON document_embeddings 
USING HNSW (embedding);

-- Insert vector data
INSERT INTO document_embeddings VALUES
(1, 'Python tutorial', [0.1, 0.2, 0.3, ...]),
(2, 'JavaScript guide', [0.2, 0.1, 0.4, ...]);

Similarity Search Queries

-- Cosine similarity search
SELECT 
    id,
    document,
    array_cosine_distance(embedding, [0.1, 0.2, 0.3, ...]) as distance
FROM document_embeddings
ORDER BY distance
LIMIT 5;

-- Using HNSW index (faster)
SELECT 
    id,
    document,
    array_cosine_distance(embedding, [0.1, 0.2, 0.3, ...]) as distance
FROM document_embeddings
WHERE array_cosine_distance(embedding, [0.1, 0.2, 0.3, ...]) < 0.5
LIMIT 5;

Python Integration with Vectors

import duckdb
import numpy as np
from sentence_transformers import SentenceTransformer

# Enable vss extension
con = duckdb.connect()
con.execute("INSTALL vss")
con.execute("LOAD vss")

# Create table
con.execute("""
    CREATE TABLE embeddings (
        id INTEGER,
        text TEXT,
        vector FLOAT[384]
    )
""")

# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = ["Python tutorial", "JavaScript guide"]

for i, text in enumerate(texts):
    vector = model.encode(text)
    con.execute(
        "INSERT INTO embeddings VALUES (?, ?, ?)",
        (i, text, list(vector))
    )

# Search
query = "programming language tutorial"
query_vector = model.encode(query)
result = con.execute("""
    SELECT id, text, array_cosine_distance(vector, ?) as distance
    FROM embeddings
    ORDER BY distance
    LIMIT 5
""", [list(query_vector)]).fetchall()

print(result)

Extension Ecosystem

-- 1. JSON extension
INSTALL json;
LOAD json;

-- 2. Spatial/GIS extension
INSTALL spatial;
LOAD spatial;

-- 3. HTTP file system
INSTALL httpfs;
LOAD httpfs;

-- 4. MySQL scanner
INSTALL mysql_scanner;
LOAD mysql_scanner;

-- 5. PostgreSQL scanner
INSTALL postgres_scanner;
LOAD postgres_scanner;

Reading from S3

-- Install HTTPFS for S3 support
INSTALL httpfs;
LOAD httpfs;

-- Configure S3 (via environment or config)
SET s3_access_key_id='your_key';
SET s3_secret_access_key='your_secret';
SET s3_region='us-east-1';

-- Query directly from S3
SELECT * FROM read_parquet('s3://bucket/data/*.parquet');

Reading from Databases

-- MySQL
INSTALL mysql_scanner;
LOAD mysql_scanner;

ATTACH 'mysql://user:pass@localhost/mydb' AS mysql_db;

SELECT * FROM mysql_db.mysql_table LIMIT 100;

-- PostgreSQL
INSTALL postgres_scanner;
LOAD postgres_scanner;

ATTACH 'postgresql://user:pass@localhost/mydb' AS pg_db;

SELECT * FROM pg_db.pg_table LIMIT 100;

Python Ecosystem Integration

Pandas Integration

import duckdb
import pandas as pd

# Pandas to DuckDB
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
con = duckdb.connect()
con.execute("CREATE TABLE test AS SELECT * FROM df")

# DuckDB to Pandas
df = con.execute("SELECT * FROM test").df()

# Directly query DataFrame
con.execute("SELECT * FROM df WHERE a > 1").df()

Polars Integration

import duckdb
import polars as pl

# Polars to DuckDB
df = pl.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
con = duckdb.connect()
con.execute("CREATE TABLE test AS SELECT * FROM df")

# DuckDB to Polars
df = con.execute("SELECT * FROM test").pl()

# Query Polars DataFrame
con.execute("SELECT * FROM df WHERE a > 1").pl()

DBT Integration

# Install dbt-duckdb
pip install dbt-duckdb

# Create dbt project
dbt init analytics_project

# Configure profiles.yml
analytics_project:
  outputs:
    dev:
      type: duckdb
      path: analytics.db
      threads: 4
# dbt model example
# models/staging/staging_orders.sql
SELECT 
    order_id,
    customer_id,
    amount,
    created_at
FROM read_csv_auto('orders.csv')
WHERE status = 'completed'

Cloud and Serverless

DuckDB in Serverless Functions

# AWS Lambda with DuckDB
import duckdb
import boto3
import json
import io

def handler(event, context):
    # Download data from S3
    s3 = boto3.client('s3')
    response = s3.get_object(Bucket='data-bucket', Key='events.csv')
    
    # Load into DuckDB
    conn = duckdb.connect(':memory:')
    conn.execute("INSTALL httpfs")
    conn.execute("LOAD httpfs")
    
    # Query
    result = conn.execute("""
        SELECT date_trunc('day', event_time) as day, 
               COUNT(*) as events
        FROM read_csv_auto('/tmp/events.csv')
        GROUP BY 1
        ORDER BY 1
    """).fetchall()
    
    return {
        'statusCode': 200,
        'body': json.dumps(result)
    }

DuckDB + MotherDuck

# MotherDuck - Managed DuckDB cloud service
# (Check latest status of MotherDuck service)

# Connect to MotherDuck
# import motherduck
# con = motherduck.connect('md:my_database')

# Query shared datasets
# result = con.execute("SELECT * FROM shared_data.table").df()

Data Science Workflows

Jupyter Integration

# In Jupyter notebook
%pip install duckdb

import duckdb
import pandas as pd

# Create connection
con = duckdb.connect('analytics.duckdb')

# Query and display
%sql SELECT * FROM events LIMIT 10

# Load large dataset efficiently
df = con.execute("""
    SELECT * 
    FROM read_parquet('s3://data/*.parquet')
    WHERE date > '2025-01-01'
""").df()

# Close connection
con.close()

ML Feature Engineering

import duckdb
import numpy as np
from sklearn.model_selection import train_test_split

# Create features in DuckDB
con = duckdb.connect('features.duckdb')
con.execute("""
    CREATE TABLE features AS
    SELECT 
        user_id,
        COUNT(*) as total_events,
        AVG(duration) as avg_duration,
        SUM(CASE WHEN action = 'purchase' THEN 1 ELSE 0 END) as purchase_count,
        MAX(event_time) as last_event
    FROM events
    GROUP BY user_id
""")

# Export to NumPy
X = con.execute("SELECT * FROM features WHERE user_id IS NOT NULL").fetchall()
y = con.execute("SELECT label FROM features WHERE user_id IS NOT NULL").fetchall()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Best Practices for 2026

# Production DuckDB setup
config = {
    'memory_limit': '8GB',
    'threads': 8,
    'checkpoint_threshold': '16GB',
    'enable_profiling': 'query_tree',
}

New Patterns

-- Use CTEs for complex queries
WITH base AS (
    SELECT * FROM read_csv_auto('data.csv')
),
filtered AS (
    SELECT * FROM base WHERE date > '2025-01-01'
)
SELECT 
    category,
    COUNT(*) as count,
    SUM(amount) as total
FROM filtered
GROUP BY category;

-- Use window functions for analytics
SELECT *,
    AVG(salary) OVER (PARTITION BY department) as dept_avg,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;

Community and Ecosystem

Growing Ecosystem

  • dbt-duckdb: dbt adapter for DuckDB
  • DuckDB-Wasm: Browser-based DuckDB
  • MotherDuck: Cloud-managed DuckDB
  • Cheatography: DuckDB cheat sheet
  • Awesome DuckDB: Curated resources

Learning Resources


Resources


Conclusion

DuckDB in 2025-2026 represents a significant evolution from its origins as a simple embedded database. With vector search, improved extensions, and growing cloud support, DuckDB is well-positioned for both traditional analytics and emerging AI workloads.

In the next article, we’ll explore AI applications with DuckDB, including vector search integration, ML pipelines, and RAG implementations.

Comments