DuckDB: The Complete Guide to Embedded Analytical Database

Introduction

DuckDB is an open-source, embedded analytical database that has revolutionized how data analysts and scientists work with data. Often called “SQLite for Analytics,” DuckDB provides high-performance OLAP capabilities in a simple, portable package that runs entirely within the application process.

In 2026, DuckDB continues to gain momentum as the go-to solution for analytical workloads. This comprehensive guide covers DuckDB fundamentals, SQL operations, and practical examples for modern data analysis.

What is DuckDB?

DuckDB is an in-process, embedded SQL OLAP database management system. Unlike traditional database servers, DuckDB runs entirely within the application, eliminating the need for separate server processes and network communication.

Key Characteristics

Embedded: Runs in-process with your application
Columnar Storage: Optimized for analytical queries
Vectorized Execution: Processes data in batches for speed
Zero Dependencies: Single binary, no external requirements
Portable: Databases stored as single files
SQL Support: Full SQL for analytics
ACID Transactions: Reliable data handling

DuckDB vs Other Databases

Aspect	DuckDB	SQLite	PostgreSQL	ClickHouse
Type	OLAP	OLTP	Hybrid	OLAP
Architecture	Embedded	Embedded	Client-Server	Client-Server
Storage	Columnar	Row-based	Row-based	Columnar
Complexity	Simple	Simple	Complex	Complex
Best For	Analytics	Embedded apps	General purpose	Heavy analytics
Parallelism	Multi-threaded	Single-threaded	Multi-threaded	Distributed

DuckDB Data Types

Numeric Types

-- Integer types
CREATE TABLE numbers (
    tiny TINYINT,
    small SMALLINT,
    integer INTEGER,
    bigint BIGINT,
    hugeint HUGEINT
);

-- Floating-point types
CREATE TABLE measurements (
    id INTEGER,
    temperature DOUBLE,
    precision_value DECIMAL(10,2)
);

-- Insert numeric data
INSERT INTO numbers VALUES (127, 32767, 2147483647, 9223372036854775807, 170141183460469231731687303715884105727);
INSERT INTO measurements VALUES (1, 23.5, 12345678.90);

String Types

-- String types
CREATE TABLE text_examples (
    id INTEGER,
    fixed_length CHAR(10),     -- Fixed length
    variable VARCHAR(100),     -- Variable length
    unlimited TEXT            -- Unlimited length
);

-- Special string functions
SELECT LENGTH('hello');
SELECT UPPER('hello'), LOWER('HELLO');
SELECT SUBSTRING('hello world', 1, 5);  -- 'hello'
SELECT REPLACE('hello world', 'world', 'duckdb');  -- 'hello duckdb'

Date and Time Types

-- Date and time types
CREATE TABLE events (
    event_date DATE,
    event_time TIME,
    event_datetime TIMESTAMP,
    event_ts TIMESTAMPTZ
);

-- Insert date/time values
INSERT INTO events VALUES 
    ('2026-03-05', '14:30:00', '2026-03-05 14:30:00', '2026-03-05 14:30:00+00');

-- Date/time functions
SELECT CURRENT_DATE;
SELECT CURRENT_TIMESTAMP;
SELECT DATE_ADD('2026-01-01', INTERVAL 30 DAY);
SELECT DATEDIFF('2026-01-01', '2026-03-05', 'day');
SELECT EXTRACT(YEAR FROM '2026-03-05');

Array Types (v0.10+)

-- Array type for vectors and multi-value columns
CREATE TABLE vectors (
    id INTEGER,
    embedding DOUBLE[3]
);

INSERT INTO vectors VALUES (1, [0.1, 0.2, 0.3]);

-- Array operations
SELECT array_length([1, 2, 3, 4]);  -- 4
SELECT [1, 2, 3] || [4, 5];        -- [1,2,3,4,5]
SELECT array_distance([1, 2, 3], [1, 2, 4]);  -- 1.0

JSON Type

-- JSON type (via json extension)
INSTALL json;
LOAD json;

CREATE TABLE config_data (
    id INTEGER,
    data JSON
);

INSERT INTO config_data VALUES (1, '{"theme": "dark", "settings": {"lang": "en"}}');

-- JSON extraction
SELECT json_extract(data, '$.theme') FROM config_data;
SELECT json_group_object(data) FROM config_data;

DuckDB Installation

CLI Installation

# macOS
brew install duckdb

# Ubuntu/Debian
wget https://github.com/duckdb/duckdb/releases/download/v1.4.4/duckdb_cli-linux-amd64.tar.gz
tar -xf duckdb_cli-linux-amd64.tar.gz
sudo mv duckdb /usr/local/bin/

# Windows
# Download from https://github.com/duckdb/duckdb/releases

# Verify installation
duckdb --version

Python Integration

pip install duckdb

import duckdb

# Create or connect to database
con = duckdb.connect('analytics.db')

# Run queries
result = con.execute("SELECT 1 as col").fetchall()
print(result)  # [(1,)]

# Close connection
con.close()

# Or use context manager
with duckdb.connect('analytics.db') as con:
    result = con.execute("SELECT * FROM read_csv_auto('data.csv')").df()

R Integration

# Install package
install.packages("duckdb")

library(DBI)
con <- dbConnect(duckdb::duckdb(), "analytics.db")

# Run query
dbGetQuery(con, "SELECT 1 as col")

# Disconnect
dbDisconnect(con)

Node.js Integration

npm install duckdb

const duckdb = require('duckdb');
const db = new duckdb.Database('analytics.db');

db.all("SELECT 1 as col", (err, rows) => {
    console.log(rows);
});

db.close();

Basic SQL Operations

Creating Tables

-- Create table with constraints
CREATE TABLE employees (
    id INTEGER PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(100) UNIQUE,
    department VARCHAR(50),
    salary DECIMAL(10,2),
    hire_date DATE DEFAULT CURRENT_DATE
);

-- Create table from CSV
CREATE TABLE sales AS 
SELECT * FROM read_csv_auto('sales_data.csv');

-- Create table from Parquet
CREATE TABLE events AS 
SELECT * FROM read_parquet('events/*.parquet');

INSERT, UPDATE, DELETE

-- Insert single row
INSERT INTO employees (id, name, email, department, salary)
VALUES (1, 'John Doe', 'john@example.com', 'Engineering', 75000.00);

-- Insert multiple rows
INSERT INTO employees VALUES 
    (2, 'Jane Smith', 'jane@example.com', 'Sales', 65000.00),
    (3, 'Bob Wilson', 'bob@example.com', 'Marketing', 70000.00);

-- Update rows
UPDATE employees 
SET salary = salary * 1.1 
WHERE department = 'Engineering';

-- Delete rows
DELETE FROM employees WHERE salary < 50000;

SELECT Queries

-- Basic SELECT
SELECT * FROM employees;
SELECT name, salary FROM employees;

-- WHERE clause
SELECT * FROM employees WHERE department = 'Engineering';
SELECT * FROM employees WHERE salary BETWEEN 60000 AND 80000;

-- ORDER BY
SELECT * FROM employees ORDER BY salary DESC;
SELECT * FROM employees ORDER BY department, salary DESC;

-- LIMIT
SELECT * FROM employees ORDER BY salary DESC LIMIT 10;

-- DISTINCT
SELECT DISTINCT department FROM employees;

Aggregate Functions

-- Basic aggregates
SELECT 
    COUNT(*) as total_employees,
    AVG(salary) as avg_salary,
    MIN(salary) as min_salary,
    MAX(salary) as max_salary,
    SUM(salary) as total_salary
FROM employees;

-- GROUP BY
SELECT 
    department,
    COUNT(*) as count,
    AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 60000;

-- Multiple aggregations
SELECT 
    department,
    COUNT(*) as headcount,
    AVG(salary) as avg_salary,
    MIN(salary) as min_salary,
    MAX(salary) as max_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

JOIN Operations

-- Create tables for examples
CREATE TABLE departments (id INTEGER, name VARCHAR(50));
CREATE TABLE employees (id INTEGER, name VARCHAR(100), dept_id INTEGER, salary DECIMAL(10,2));

-- INNER JOIN
SELECT e.name, d.name as department
FROM employees e
JOIN departments d ON e.dept_id = d.id;

-- LEFT JOIN
SELECT e.name, d.name as department
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;

-- Multiple joins
SELECT 
    e.name,
    d.name as department,
    p.name as project
FROM employees e
JOIN departments d ON e.dept_id = d.id
JOIN employee_projects ep ON e.id = ep.employee_id
JOIN projects p ON ep.project_id = p.id;

Subqueries

-- Subquery in WHERE
SELECT * FROM employees 
WHERE salary > (SELECT AVG(salary) FROM employees);

-- Subquery in FROM (CTE)
WITH dept_avg AS (
    SELECT department, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
)
SELECT * FROM dept_avg WHERE avg_salary > 70000;

-- Correlated subquery
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (SELECT AVG(salary) FROM employees WHERE department = e.department);

Working with Data Files

Reading CSV Files

-- Auto-detect schema
SELECT * FROM read_csv_auto('data.csv');

-- Specify options
SELECT * FROM read_csv(
    'data.csv',
    header=true,
    delimiter=',',
    columns={'id': 'INTEGER', 'name': 'VARCHAR', 'value': 'DOUBLE'}
);

-- Query without importing
SELECT department, COUNT(*), AVG(salary)
FROM read_csv_auto('employees.csv')
GROUP BY department;

Reading Parquet Files

-- Read single Parquet file
SELECT * FROM read_parquet('data.parquet');

-- Read multiple Parquet files (glob)
SELECT * FROM read_parquet('data/*.parquet');

-- Read from folder
SELECT * FROM read_parquet('events/2024/*.parquet');

Reading JSON Files

-- Read JSON lines
SELECT * FROM read_json_auto('data.jsonl');

-- Read JSON array
SELECT * FROM read_json('data.json', format='array');

-- Extract JSON data
SELECT 
    json_extract(data, '$.name') as name,
    json_extract(data, '$.age') as age
FROM read_json_auto('users.json');

Exporting Data

-- Export to CSV
COPY (SELECT * FROM employees) TO 'employees.csv' (FORMAT CSV, HEADER);

-- Export to Parquet
COPY (SELECT * FROM employees) TO 'employees.parquet' (FORMAT PARQUET);

-- Export to JSON
COPY (SELECT * FROM employees) TO 'employees.json' (FORMAT JSON);

Advanced Queries

Window Functions

-- Running total
SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) as running_total
FROM daily_sales;

-- Rank within groups
SELECT 
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) as overall_rank
FROM employees;

-- Moving average
SELECT 
    date,
    price,
    AVG(price) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as moving_avg_7
FROM stock_prices;

-- Lead and Lag
SELECT 
    name,
    hire_date,
    LAG(hire_date) OVER (ORDER BY hire_date) as prev_hire,
    LEAD(hire_date) OVER (ORDER BY hire_date) as next_hire
FROM employees;

Pivot Operations

-- Pivot table
SELECT * FROM employees
PIVOT (
    SELECT department, salary FROM employees
    ON department
    USING SUM(salary)
);

-- Transpose with GROUP BY
SELECT 
    '2024' as year,
    SUM(CASE WHEN quarter = 'Q1' THEN revenue END) as Q1,
    SUM(CASE WHEN quarter = 'Q2' THEN revenue END) as Q2,
    SUM(CASE WHEN quarter = 'Q3' THEN revenue END) as Q3,
    SUM(CASE WHEN quarter = 'Q4' THEN revenue END) as Q4
FROM quarterly_sales
GROUP BY 1;

Complex Aggregations

-- Multiple GROUPING SETS
SELECT 
    department,
    location,
    SUM(salary) as total_salary
FROM employees
GROUP BY 
    GROUPING SETS (
        (department, location),
        (department),
        (location),
        ()
    );

-- ROLLUP
SELECT 
    department, 
    location,
    SUM(salary) as total
FROM employees
GROUP BY ROLLUP (department, location);

-- CUBE
SELECT 
    department,
    location,
    SUM(salary) as total
FROM employees
GROUP BY CUBE (department, location);

Views and Materialization

Creating Views

-- Create view
CREATE VIEW engineering_salary AS
SELECT name, salary, hire_date
FROM employees
WHERE department = 'Engineering';

-- Use view
SELECT * FROM engineering_salary ORDER BY salary DESC;

-- Create or replace view
CREATE OR REPLACE VIEW high_earners AS
SELECT name, salary, department
FROM employees
WHERE salary > 80000;

Querying Views

-- View with aggregation
CREATE VIEW dept_summary AS
SELECT 
    department,
    COUNT(*) as employees,
    AVG(salary) as avg_salary
FROM employees
GROUP BY department;

-- Query view with additional filters
SELECT * FROM dept_summary WHERE employees > 5;

Best Practices

Query Optimization

-- Use EXPLAIN to analyze queries
EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';

-- Filter early
-- Good: Filter before aggregation
SELECT department, AVG(salary)
FROM employees
WHERE salary > 50000
GROUP BY department;

-- Use appropriate data types
CREATE TABLE test (
    id INTEGER,        -- Not VARCHAR
    value DECIMAL(10,2)  -- Not TEXT
);

Data Loading

-- Use COPY for bulk loading
COPY large_table FROM 'data.csv' (AUTO_DETECT TRUE);

-- Use parallel reading
SELECT * FROM read_csv_auto('data.csv', parallel=true);

-- Batch inserts
INSERT INTO table VALUES (1, 'a'), (2, 'b'), (3, 'c');

Common Pitfalls

Pitfall 1: Not Using Prepared Statements

# Bad: String formatting
query = f"SELECT * FROM users WHERE id = {user_id}"

# Good: Parameterized queries
con.execute("SELECT * FROM users WHERE id = ?", [user_id])

Pitfall 2: Large Result Sets in Memory

# Bad: Loading all results
result = con.execute("SELECT * FROM large_table").fetchall()

# Good: Use fetchmany or pandas
df = con.execute("SELECT * FROM large_table").df()

Pitfall 3: Ignoring Parallelism

-- Enable parallelism for large scans
SET threads=8;

-- Check current setting
SELECT current_setting('threads');

Resources

Conclusion

DuckDB provides an excellent solution for analytical workloads, combining the simplicity of embedded databases with powerful OLAP capabilities. Its zero-configuration setup and excellent performance make it ideal for data analysis, prototyping, and production analytics.

In the next article, we’ll explore DuckDB operations including configuration, performance tuning, and deployment strategies.

Introduction

What is DuckDB?

Key Characteristics

DuckDB vs Other Databases

DuckDB Data Types

Numeric Types

String Types

Date and Time Types

Array Types (v0.10+)

JSON Type

DuckDB Installation

CLI Installation

Python Integration

R Integration

Node.js Integration

Basic SQL Operations

Creating Tables

INSERT, UPDATE, DELETE

SELECT Queries

Aggregate Functions

JOIN Operations

Subqueries

Working with Data Files

Reading CSV Files

Reading Parquet Files

Reading JSON Files

Exporting Data

Advanced Queries

Window Functions

Pivot Operations

Complex Aggregations

Views and Materialization

Creating Views

Querying Views

Best Practices

Query Optimization

Data Loading

Common Pitfalls

Pitfall 1: Not Using Prepared Statements

Pitfall 2: Large Result Sets in Memory

Pitfall 3: Ignoring Parallelism

Resources

Conclusion

Comments

Share this article

👍 Was this article helpful?