Skip to main content
โšก Calmops

DuckDB: The Complete Guide to Embedded Analytical Database

Introduction

DuckDB is an open-source, embedded analytical database that has revolutionized how data analysts and scientists work with data. Often called “SQLite for Analytics,” DuckDB provides high-performance OLAP capabilities in a simple, portable package that runs entirely within the application process.

In 2026, DuckDB continues to gain momentum as the go-to solution for analytical workloads. This comprehensive guide covers DuckDB fundamentals, SQL operations, and practical examples for modern data analysis.


What is DuckDB?

DuckDB is an in-process, embedded SQL OLAP database management system. Unlike traditional database servers, DuckDB runs entirely within the application, eliminating the need for separate server processes and network communication.

Key Characteristics

  • Embedded: Runs in-process with your application
  • Columnar Storage: Optimized for analytical queries
  • Vectorized Execution: Processes data in batches for speed
  • Zero Dependencies: Single binary, no external requirements
  • Portable: Databases stored as single files
  • SQL Support: Full SQL for analytics
  • ACID Transactions: Reliable data handling

DuckDB vs Other Databases

Aspect DuckDB SQLite PostgreSQL ClickHouse
Type OLAP OLTP Hybrid OLAP
Architecture Embedded Embedded Client-Server Client-Server
Storage Columnar Row-based Row-based Columnar
Complexity Simple Simple Complex Complex
Best For Analytics Embedded apps General purpose Heavy analytics
Parallelism Multi-threaded Single-threaded Multi-threaded Distributed

DuckDB Data Types

Numeric Types

-- Integer types
CREATE TABLE numbers (
    tiny TINYINT,
    small SMALLINT,
    integer INTEGER,
    bigint BIGINT,
    hugeint HUGEINT
);

-- Floating-point types
CREATE TABLE measurements (
    id INTEGER,
    temperature DOUBLE,
    precision_value DECIMAL(10,2)
);

-- Insert numeric data
INSERT INTO numbers VALUES (127, 32767, 2147483647, 9223372036854775807, 170141183460469231731687303715884105727);
INSERT INTO measurements VALUES (1, 23.5, 12345678.90);

String Types

-- String types
CREATE TABLE text_examples (
    id INTEGER,
    fixed_length CHAR(10),     -- Fixed length
    variable VARCHAR(100),     -- Variable length
    unlimited TEXT            -- Unlimited length
);

-- Special string functions
SELECT LENGTH('hello');
SELECT UPPER('hello'), LOWER('HELLO');
SELECT SUBSTRING('hello world', 1, 5);  -- 'hello'
SELECT REPLACE('hello world', 'world', 'duckdb');  -- 'hello duckdb'

Date and Time Types

-- Date and time types
CREATE TABLE events (
    event_date DATE,
    event_time TIME,
    event_datetime TIMESTAMP,
    event_ts TIMESTAMPTZ
);

-- Insert date/time values
INSERT INTO events VALUES 
    ('2026-03-05', '14:30:00', '2026-03-05 14:30:00', '2026-03-05 14:30:00+00');

-- Date/time functions
SELECT CURRENT_DATE;
SELECT CURRENT_TIMESTAMP;
SELECT DATE_ADD('2026-01-01', INTERVAL 30 DAY);
SELECT DATEDIFF('2026-01-01', '2026-03-05', 'day');
SELECT EXTRACT(YEAR FROM '2026-03-05');

Array Types (v0.10+)

-- Array type for vectors and multi-value columns
CREATE TABLE vectors (
    id INTEGER,
    embedding DOUBLE[3]
);

INSERT INTO vectors VALUES (1, [0.1, 0.2, 0.3]);

-- Array operations
SELECT array_length([1, 2, 3, 4]);  -- 4
SELECT [1, 2, 3] || [4, 5];        -- [1,2,3,4,5]
SELECT array_distance([1, 2, 3], [1, 2, 4]);  -- 1.0

JSON Type

-- JSON type (via json extension)
INSTALL json;
LOAD json;

CREATE TABLE config_data (
    id INTEGER,
    data JSON
);

INSERT INTO config_data VALUES (1, '{"theme": "dark", "settings": {"lang": "en"}}');

-- JSON extraction
SELECT json_extract(data, '$.theme') FROM config_data;
SELECT json_group_object(data) FROM config_data;

DuckDB Installation

CLI Installation

# macOS
brew install duckdb

# Ubuntu/Debian
wget https://github.com/duckdb/duckdb/releases/download/v1.4.4/duckdb_cli-linux-amd64.tar.gz
tar -xf duckdb_cli-linux-amd64.tar.gz
sudo mv duckdb /usr/local/bin/

# Windows
# Download from https://github.com/duckdb/duckdb/releases

# Verify installation
duckdb --version

Python Integration

pip install duckdb
import duckdb

# Create or connect to database
con = duckdb.connect('analytics.db')

# Run queries
result = con.execute("SELECT 1 as col").fetchall()
print(result)  # [(1,)]

# Close connection
con.close()

# Or use context manager
with duckdb.connect('analytics.db') as con:
    result = con.execute("SELECT * FROM read_csv_auto('data.csv')").df()

R Integration

# Install package
install.packages("duckdb")

library(DBI)
con <- dbConnect(duckdb::duckdb(), "analytics.db")

# Run query
dbGetQuery(con, "SELECT 1 as col")

# Disconnect
dbDisconnect(con)

Node.js Integration

npm install duckdb
const duckdb = require('duckdb');
const db = new duckdb.Database('analytics.db');

db.all("SELECT 1 as col", (err, rows) => {
    console.log(rows);
});

db.close();

Basic SQL Operations

Creating Tables

-- Create table with constraints
CREATE TABLE employees (
    id INTEGER PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(100) UNIQUE,
    department VARCHAR(50),
    salary DECIMAL(10,2),
    hire_date DATE DEFAULT CURRENT_DATE
);

-- Create table from CSV
CREATE TABLE sales AS 
SELECT * FROM read_csv_auto('sales_data.csv');

-- Create table from Parquet
CREATE TABLE events AS 
SELECT * FROM read_parquet('events/*.parquet');

INSERT, UPDATE, DELETE

-- Insert single row
INSERT INTO employees (id, name, email, department, salary)
VALUES (1, 'John Doe', '[email protected]', 'Engineering', 75000.00);

-- Insert multiple rows
INSERT INTO employees VALUES 
    (2, 'Jane Smith', '[email protected]', 'Sales', 65000.00),
    (3, 'Bob Wilson', '[email protected]', 'Marketing', 70000.00);

-- Update rows
UPDATE employees 
SET salary = salary * 1.1 
WHERE department = 'Engineering';

-- Delete rows
DELETE FROM employees WHERE salary < 50000;

SELECT Queries

-- Basic SELECT
SELECT * FROM employees;
SELECT name, salary FROM employees;

-- WHERE clause
SELECT * FROM employees WHERE department = 'Engineering';
SELECT * FROM employees WHERE salary BETWEEN 60000 AND 80000;

-- ORDER BY
SELECT * FROM employees ORDER BY salary DESC;
SELECT * FROM employees ORDER BY department, salary DESC;

-- LIMIT
SELECT * FROM employees ORDER BY salary DESC LIMIT 10;

-- DISTINCT
SELECT DISTINCT department FROM employees;

Aggregate Functions

-- Basic aggregates
SELECT 
    COUNT(*) as total_employees,
    AVG(salary) as avg_salary,
    MIN(salary) as min_salary,
    MAX(salary) as max_salary,
    SUM(salary) as total_salary
FROM employees;

-- GROUP BY
SELECT 
    department,
    COUNT(*) as count,
    AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 60000;

-- Multiple aggregations
SELECT 
    department,
    COUNT(*) as headcount,
    AVG(salary) as avg_salary,
    MIN(salary) as min_salary,
    MAX(salary) as max_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

JOIN Operations

-- Create tables for examples
CREATE TABLE departments (id INTEGER, name VARCHAR(50));
CREATE TABLE employees (id INTEGER, name VARCHAR(100), dept_id INTEGER, salary DECIMAL(10,2));

-- INNER JOIN
SELECT e.name, d.name as department
FROM employees e
JOIN departments d ON e.dept_id = d.id;

-- LEFT JOIN
SELECT e.name, d.name as department
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;

-- Multiple joins
SELECT 
    e.name,
    d.name as department,
    p.name as project
FROM employees e
JOIN departments d ON e.dept_id = d.id
JOIN employee_projects ep ON e.id = ep.employee_id
JOIN projects p ON ep.project_id = p.id;

Subqueries

-- Subquery in WHERE
SELECT * FROM employees 
WHERE salary > (SELECT AVG(salary) FROM employees);

-- Subquery in FROM (CTE)
WITH dept_avg AS (
    SELECT department, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
)
SELECT * FROM dept_avg WHERE avg_salary > 70000;

-- Correlated subquery
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (SELECT AVG(salary) FROM employees WHERE department = e.department);

Working with Data Files

Reading CSV Files

-- Auto-detect schema
SELECT * FROM read_csv_auto('data.csv');

-- Specify options
SELECT * FROM read_csv(
    'data.csv',
    header=true,
    delimiter=',',
    columns={'id': 'INTEGER', 'name': 'VARCHAR', 'value': 'DOUBLE'}
);

-- Query without importing
SELECT department, COUNT(*), AVG(salary)
FROM read_csv_auto('employees.csv')
GROUP BY department;

Reading Parquet Files

-- Read single Parquet file
SELECT * FROM read_parquet('data.parquet');

-- Read multiple Parquet files (glob)
SELECT * FROM read_parquet('data/*.parquet');

-- Read from folder
SELECT * FROM read_parquet('events/2024/*.parquet');

Reading JSON Files

-- Read JSON lines
SELECT * FROM read_json_auto('data.jsonl');

-- Read JSON array
SELECT * FROM read_json('data.json', format='array');

-- Extract JSON data
SELECT 
    json_extract(data, '$.name') as name,
    json_extract(data, '$.age') as age
FROM read_json_auto('users.json');

Exporting Data

-- Export to CSV
COPY (SELECT * FROM employees) TO 'employees.csv' (FORMAT CSV, HEADER);

-- Export to Parquet
COPY (SELECT * FROM employees) TO 'employees.parquet' (FORMAT PARQUET);

-- Export to JSON
COPY (SELECT * FROM employees) TO 'employees.json' (FORMAT JSON);

Advanced Queries

Window Functions

-- Running total
SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) as running_total
FROM daily_sales;

-- Rank within groups
SELECT 
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) as overall_rank
FROM employees;

-- Moving average
SELECT 
    date,
    price,
    AVG(price) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as moving_avg_7
FROM stock_prices;

-- Lead and Lag
SELECT 
    name,
    hire_date,
    LAG(hire_date) OVER (ORDER BY hire_date) as prev_hire,
    LEAD(hire_date) OVER (ORDER BY hire_date) as next_hire
FROM employees;

Pivot Operations

-- Pivot table
SELECT * FROM employees
PIVOT (
    SELECT department, salary FROM employees
    ON department
    USING SUM(salary)
);

-- Transpose with GROUP BY
SELECT 
    '2024' as year,
    SUM(CASE WHEN quarter = 'Q1' THEN revenue END) as Q1,
    SUM(CASE WHEN quarter = 'Q2' THEN revenue END) as Q2,
    SUM(CASE WHEN quarter = 'Q3' THEN revenue END) as Q3,
    SUM(CASE WHEN quarter = 'Q4' THEN revenue END) as Q4
FROM quarterly_sales
GROUP BY 1;

Complex Aggregations

-- Multiple GROUPING SETS
SELECT 
    department,
    location,
    SUM(salary) as total_salary
FROM employees
GROUP BY 
    GROUPING SETS (
        (department, location),
        (department),
        (location),
        ()
    );

-- ROLLUP
SELECT 
    department, 
    location,
    SUM(salary) as total
FROM employees
GROUP BY ROLLUP (department, location);

-- CUBE
SELECT 
    department,
    location,
    SUM(salary) as total
FROM employees
GROUP BY CUBE (department, location);

Views and Materialization

Creating Views

-- Create view
CREATE VIEW engineering_salary AS
SELECT name, salary, hire_date
FROM employees
WHERE department = 'Engineering';

-- Use view
SELECT * FROM engineering_salary ORDER BY salary DESC;

-- Create or replace view
CREATE OR REPLACE VIEW high_earners AS
SELECT name, salary, department
FROM employees
WHERE salary > 80000;

Querying Views

-- View with aggregation
CREATE VIEW dept_summary AS
SELECT 
    department,
    COUNT(*) as employees,
    AVG(salary) as avg_salary
FROM employees
GROUP BY department;

-- Query view with additional filters
SELECT * FROM dept_summary WHERE employees > 5;

Best Practices

Query Optimization

-- Use EXPLAIN to analyze queries
EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';

-- Filter early
-- Good: Filter before aggregation
SELECT department, AVG(salary)
FROM employees
WHERE salary > 50000
GROUP BY department;

-- Use appropriate data types
CREATE TABLE test (
    id INTEGER,        -- Not VARCHAR
    value DECIMAL(10,2)  -- Not TEXT
);

Data Loading

-- Use COPY for bulk loading
COPY large_table FROM 'data.csv' (AUTO_DETECT TRUE);

-- Use parallel reading
SELECT * FROM read_csv_auto('data.csv', parallel=true);

-- Batch inserts
INSERT INTO table VALUES (1, 'a'), (2, 'b'), (3, 'c');

Common Pitfalls

Pitfall 1: Not Using Prepared Statements

# Bad: String formatting
query = f"SELECT * FROM users WHERE id = {user_id}"

# Good: Parameterized queries
con.execute("SELECT * FROM users WHERE id = ?", [user_id])

Pitfall 2: Large Result Sets in Memory

# Bad: Loading all results
result = con.execute("SELECT * FROM large_table").fetchall()

# Good: Use fetchmany or pandas
df = con.execute("SELECT * FROM large_table").df()

Pitfall 3: Ignoring Parallelism

-- Enable parallelism for large scans
SET threads=8;

-- Check current setting
SELECT current_setting('threads');

Resources


Conclusion

DuckDB provides an excellent solution for analytical workloads, combining the simplicity of embedded databases with powerful OLAP capabilities. Its zero-configuration setup and excellent performance make it ideal for data analysis, prototyping, and production analytics.

In the next article, we’ll explore DuckDB operations including configuration, performance tuning, and deployment strategies.

Comments