Introduction
DuckDB is an open-source, embedded analytical database that has revolutionized how data analysts and scientists work with data. Often called “SQLite for Analytics,” DuckDB provides high-performance OLAP capabilities in a simple, portable package that runs entirely within the application process.
In 2026, DuckDB continues to gain momentum as the go-to solution for analytical workloads. This comprehensive guide covers DuckDB fundamentals, SQL operations, and practical examples for modern data analysis.
What is DuckDB?
DuckDB is an in-process, embedded SQL OLAP database management system. Unlike traditional database servers, DuckDB runs entirely within the application, eliminating the need for separate server processes and network communication.
Key Characteristics
- Embedded: Runs in-process with your application
- Columnar Storage: Optimized for analytical queries
- Vectorized Execution: Processes data in batches for speed
- Zero Dependencies: Single binary, no external requirements
- Portable: Databases stored as single files
- SQL Support: Full SQL for analytics
- ACID Transactions: Reliable data handling
DuckDB vs Other Databases
| Aspect | DuckDB | SQLite | PostgreSQL | ClickHouse |
|---|---|---|---|---|
| Type | OLAP | OLTP | Hybrid | OLAP |
| Architecture | Embedded | Embedded | Client-Server | Client-Server |
| Storage | Columnar | Row-based | Row-based | Columnar |
| Complexity | Simple | Simple | Complex | Complex |
| Best For | Analytics | Embedded apps | General purpose | Heavy analytics |
| Parallelism | Multi-threaded | Single-threaded | Multi-threaded | Distributed |
DuckDB Data Types
Numeric Types
-- Integer types
CREATE TABLE numbers (
tiny TINYINT,
small SMALLINT,
integer INTEGER,
bigint BIGINT,
hugeint HUGEINT
);
-- Floating-point types
CREATE TABLE measurements (
id INTEGER,
temperature DOUBLE,
precision_value DECIMAL(10,2)
);
-- Insert numeric data
INSERT INTO numbers VALUES (127, 32767, 2147483647, 9223372036854775807, 170141183460469231731687303715884105727);
INSERT INTO measurements VALUES (1, 23.5, 12345678.90);
String Types
-- String types
CREATE TABLE text_examples (
id INTEGER,
fixed_length CHAR(10), -- Fixed length
variable VARCHAR(100), -- Variable length
unlimited TEXT -- Unlimited length
);
-- Special string functions
SELECT LENGTH('hello');
SELECT UPPER('hello'), LOWER('HELLO');
SELECT SUBSTRING('hello world', 1, 5); -- 'hello'
SELECT REPLACE('hello world', 'world', 'duckdb'); -- 'hello duckdb'
Date and Time Types
-- Date and time types
CREATE TABLE events (
event_date DATE,
event_time TIME,
event_datetime TIMESTAMP,
event_ts TIMESTAMPTZ
);
-- Insert date/time values
INSERT INTO events VALUES
('2026-03-05', '14:30:00', '2026-03-05 14:30:00', '2026-03-05 14:30:00+00');
-- Date/time functions
SELECT CURRENT_DATE;
SELECT CURRENT_TIMESTAMP;
SELECT DATE_ADD('2026-01-01', INTERVAL 30 DAY);
SELECT DATEDIFF('2026-01-01', '2026-03-05', 'day');
SELECT EXTRACT(YEAR FROM '2026-03-05');
Array Types (v0.10+)
-- Array type for vectors and multi-value columns
CREATE TABLE vectors (
id INTEGER,
embedding DOUBLE[3]
);
INSERT INTO vectors VALUES (1, [0.1, 0.2, 0.3]);
-- Array operations
SELECT array_length([1, 2, 3, 4]); -- 4
SELECT [1, 2, 3] || [4, 5]; -- [1,2,3,4,5]
SELECT array_distance([1, 2, 3], [1, 2, 4]); -- 1.0
JSON Type
-- JSON type (via json extension)
INSTALL json;
LOAD json;
CREATE TABLE config_data (
id INTEGER,
data JSON
);
INSERT INTO config_data VALUES (1, '{"theme": "dark", "settings": {"lang": "en"}}');
-- JSON extraction
SELECT json_extract(data, '$.theme') FROM config_data;
SELECT json_group_object(data) FROM config_data;
DuckDB Installation
CLI Installation
# macOS
brew install duckdb
# Ubuntu/Debian
wget https://github.com/duckdb/duckdb/releases/download/v1.4.4/duckdb_cli-linux-amd64.tar.gz
tar -xf duckdb_cli-linux-amd64.tar.gz
sudo mv duckdb /usr/local/bin/
# Windows
# Download from https://github.com/duckdb/duckdb/releases
# Verify installation
duckdb --version
Python Integration
pip install duckdb
import duckdb
# Create or connect to database
con = duckdb.connect('analytics.db')
# Run queries
result = con.execute("SELECT 1 as col").fetchall()
print(result) # [(1,)]
# Close connection
con.close()
# Or use context manager
with duckdb.connect('analytics.db') as con:
result = con.execute("SELECT * FROM read_csv_auto('data.csv')").df()
R Integration
# Install package
install.packages("duckdb")
library(DBI)
con <- dbConnect(duckdb::duckdb(), "analytics.db")
# Run query
dbGetQuery(con, "SELECT 1 as col")
# Disconnect
dbDisconnect(con)
Node.js Integration
npm install duckdb
const duckdb = require('duckdb');
const db = new duckdb.Database('analytics.db');
db.all("SELECT 1 as col", (err, rows) => {
console.log(rows);
});
db.close();
Basic SQL Operations
Creating Tables
-- Create table with constraints
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(100) UNIQUE,
department VARCHAR(50),
salary DECIMAL(10,2),
hire_date DATE DEFAULT CURRENT_DATE
);
-- Create table from CSV
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('sales_data.csv');
-- Create table from Parquet
CREATE TABLE events AS
SELECT * FROM read_parquet('events/*.parquet');
INSERT, UPDATE, DELETE
-- Insert single row
INSERT INTO employees (id, name, email, department, salary)
VALUES (1, 'John Doe', '[email protected]', 'Engineering', 75000.00);
-- Insert multiple rows
INSERT INTO employees VALUES
(2, 'Jane Smith', '[email protected]', 'Sales', 65000.00),
(3, 'Bob Wilson', '[email protected]', 'Marketing', 70000.00);
-- Update rows
UPDATE employees
SET salary = salary * 1.1
WHERE department = 'Engineering';
-- Delete rows
DELETE FROM employees WHERE salary < 50000;
SELECT Queries
-- Basic SELECT
SELECT * FROM employees;
SELECT name, salary FROM employees;
-- WHERE clause
SELECT * FROM employees WHERE department = 'Engineering';
SELECT * FROM employees WHERE salary BETWEEN 60000 AND 80000;
-- ORDER BY
SELECT * FROM employees ORDER BY salary DESC;
SELECT * FROM employees ORDER BY department, salary DESC;
-- LIMIT
SELECT * FROM employees ORDER BY salary DESC LIMIT 10;
-- DISTINCT
SELECT DISTINCT department FROM employees;
Aggregate Functions
-- Basic aggregates
SELECT
COUNT(*) as total_employees,
AVG(salary) as avg_salary,
MIN(salary) as min_salary,
MAX(salary) as max_salary,
SUM(salary) as total_salary
FROM employees;
-- GROUP BY
SELECT
department,
COUNT(*) as count,
AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 60000;
-- Multiple aggregations
SELECT
department,
COUNT(*) as headcount,
AVG(salary) as avg_salary,
MIN(salary) as min_salary,
MAX(salary) as max_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;
JOIN Operations
-- Create tables for examples
CREATE TABLE departments (id INTEGER, name VARCHAR(50));
CREATE TABLE employees (id INTEGER, name VARCHAR(100), dept_id INTEGER, salary DECIMAL(10,2));
-- INNER JOIN
SELECT e.name, d.name as department
FROM employees e
JOIN departments d ON e.dept_id = d.id;
-- LEFT JOIN
SELECT e.name, d.name as department
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;
-- Multiple joins
SELECT
e.name,
d.name as department,
p.name as project
FROM employees e
JOIN departments d ON e.dept_id = d.id
JOIN employee_projects ep ON e.id = ep.employee_id
JOIN projects p ON ep.project_id = p.id;
Subqueries
-- Subquery in WHERE
SELECT * FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Subquery in FROM (CTE)
WITH dept_avg AS (
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
)
SELECT * FROM dept_avg WHERE avg_salary > 70000;
-- Correlated subquery
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (SELECT AVG(salary) FROM employees WHERE department = e.department);
Working with Data Files
Reading CSV Files
-- Auto-detect schema
SELECT * FROM read_csv_auto('data.csv');
-- Specify options
SELECT * FROM read_csv(
'data.csv',
header=true,
delimiter=',',
columns={'id': 'INTEGER', 'name': 'VARCHAR', 'value': 'DOUBLE'}
);
-- Query without importing
SELECT department, COUNT(*), AVG(salary)
FROM read_csv_auto('employees.csv')
GROUP BY department;
Reading Parquet Files
-- Read single Parquet file
SELECT * FROM read_parquet('data.parquet');
-- Read multiple Parquet files (glob)
SELECT * FROM read_parquet('data/*.parquet');
-- Read from folder
SELECT * FROM read_parquet('events/2024/*.parquet');
Reading JSON Files
-- Read JSON lines
SELECT * FROM read_json_auto('data.jsonl');
-- Read JSON array
SELECT * FROM read_json('data.json', format='array');
-- Extract JSON data
SELECT
json_extract(data, '$.name') as name,
json_extract(data, '$.age') as age
FROM read_json_auto('users.json');
Exporting Data
-- Export to CSV
COPY (SELECT * FROM employees) TO 'employees.csv' (FORMAT CSV, HEADER);
-- Export to Parquet
COPY (SELECT * FROM employees) TO 'employees.parquet' (FORMAT PARQUET);
-- Export to JSON
COPY (SELECT * FROM employees) TO 'employees.json' (FORMAT JSON);
Advanced Queries
Window Functions
-- Running total
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) as running_total
FROM daily_sales;
-- Rank within groups
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
DENSE_RANK() OVER (ORDER BY salary DESC) as overall_rank
FROM employees;
-- Moving average
SELECT
date,
price,
AVG(price) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7
FROM stock_prices;
-- Lead and Lag
SELECT
name,
hire_date,
LAG(hire_date) OVER (ORDER BY hire_date) as prev_hire,
LEAD(hire_date) OVER (ORDER BY hire_date) as next_hire
FROM employees;
Pivot Operations
-- Pivot table
SELECT * FROM employees
PIVOT (
SELECT department, salary FROM employees
ON department
USING SUM(salary)
);
-- Transpose with GROUP BY
SELECT
'2024' as year,
SUM(CASE WHEN quarter = 'Q1' THEN revenue END) as Q1,
SUM(CASE WHEN quarter = 'Q2' THEN revenue END) as Q2,
SUM(CASE WHEN quarter = 'Q3' THEN revenue END) as Q3,
SUM(CASE WHEN quarter = 'Q4' THEN revenue END) as Q4
FROM quarterly_sales
GROUP BY 1;
Complex Aggregations
-- Multiple GROUPING SETS
SELECT
department,
location,
SUM(salary) as total_salary
FROM employees
GROUP BY
GROUPING SETS (
(department, location),
(department),
(location),
()
);
-- ROLLUP
SELECT
department,
location,
SUM(salary) as total
FROM employees
GROUP BY ROLLUP (department, location);
-- CUBE
SELECT
department,
location,
SUM(salary) as total
FROM employees
GROUP BY CUBE (department, location);
Views and Materialization
Creating Views
-- Create view
CREATE VIEW engineering_salary AS
SELECT name, salary, hire_date
FROM employees
WHERE department = 'Engineering';
-- Use view
SELECT * FROM engineering_salary ORDER BY salary DESC;
-- Create or replace view
CREATE OR REPLACE VIEW high_earners AS
SELECT name, salary, department
FROM employees
WHERE salary > 80000;
Querying Views
-- View with aggregation
CREATE VIEW dept_summary AS
SELECT
department,
COUNT(*) as employees,
AVG(salary) as avg_salary
FROM employees
GROUP BY department;
-- Query view with additional filters
SELECT * FROM dept_summary WHERE employees > 5;
Best Practices
Query Optimization
-- Use EXPLAIN to analyze queries
EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';
-- Filter early
-- Good: Filter before aggregation
SELECT department, AVG(salary)
FROM employees
WHERE salary > 50000
GROUP BY department;
-- Use appropriate data types
CREATE TABLE test (
id INTEGER, -- Not VARCHAR
value DECIMAL(10,2) -- Not TEXT
);
Data Loading
-- Use COPY for bulk loading
COPY large_table FROM 'data.csv' (AUTO_DETECT TRUE);
-- Use parallel reading
SELECT * FROM read_csv_auto('data.csv', parallel=true);
-- Batch inserts
INSERT INTO table VALUES (1, 'a'), (2, 'b'), (3, 'c');
Common Pitfalls
Pitfall 1: Not Using Prepared Statements
# Bad: String formatting
query = f"SELECT * FROM users WHERE id = {user_id}"
# Good: Parameterized queries
con.execute("SELECT * FROM users WHERE id = ?", [user_id])
Pitfall 2: Large Result Sets in Memory
# Bad: Loading all results
result = con.execute("SELECT * FROM large_table").fetchall()
# Good: Use fetchmany or pandas
df = con.execute("SELECT * FROM large_table").df()
Pitfall 3: Ignoring Parallelism
-- Enable parallelism for large scans
SET threads=8;
-- Check current setting
SELECT current_setting('threads');
Resources
Conclusion
DuckDB provides an excellent solution for analytical workloads, combining the simplicity of embedded databases with powerful OLAP capabilities. Its zero-configuration setup and excellent performance make it ideal for data analysis, prototyping, and production analytics.
In the next article, we’ll explore DuckDB operations including configuration, performance tuning, and deployment strategies.
Comments