Data Engineering

Data Engineering Hub

Practical guidance for building reliable, observable, and cost-effective data pipelines and platforms. This hub covers batch and streaming ETL/ELT, data lakehouse patterns, orchestration, real-time processing, data quality, governance, and the tools widely used in 2025–2026.

🚀 Getting started

New to data engineering? Start with these concise, high-value guides from the collection below:

ETL vs ELT: Modern Data Integration Patterns — when to use ETL vs ELT
Data Pipeline Orchestration: Complete Guide — orchestration concepts and tool comparisons
Stream Processing: Kafka & Flink — core streaming concepts and patterns
Data Lakehouse Architecture: Complete Guide — lakehouse patterns, engines, and trade-offs

📚 Grouped article index

All articles in content/data-engineering/ are organized below by topic area. Links use the file slugs (filename without .md). If an article’s link text should differ, update its title frontmatter. To regenerate this index programmatically use ./scripts/update_index.py content/data-engineering --dry-run and then run without --dry-run to write.

🧱 Core Concepts & Overviews

🗄️ Storage, Lakehouse & Warehouses

⚙️ Orchestration & Workflow

⚡ Streaming & Real-time

🧩 Transformation & Modeling

[dbt-style transformations and modeling — (see transformation articles)](/update slug if dbt-specific articles are added/)
Data Modeling for Analytics (Dimensional Modeling)
Time Series Analysis: Introduction

🧪 Data Quality, Observability & Testing

🔐 Governance, Catalog & Access

🧭 Architecture Patterns & Platforms

Data Mesh: Domain-Owned Data Platforms
Data Mesh: Implementation — Complete Guide
[Platform engineering patterns for data teams — (see platform articles)](/add slug if present/)

🧠 Machine Learning & MLOps

🔧 Tools & Engine-Specific Guides

📚 Career & Miscellaneous

🔁 Index maintenance

To regenerate the index automatically: ./scripts/update_index.py content/data-engineering --include-description (dry-run first to preview).
After changes to article frontmatter (title, categories, tags), re-run the index generator and verify with hugo --quiet.
Keep the index grouped logically — add new articles to the most relevant group and keep the lists alphabetical within each group.

🎓 Who this hub is for

Data engineers building and operating ETL/ELT and streaming systems
Platform engineers creating self-service data platforms and pipelines
Analytics engineers using dbt and SQL to produce reliable BI datasets
SREs and operators responsible for pipeline reliability and cost control
Product engineers integrating real-time data into user-facing features

📖 External resources

Apache Kafka — https://kafka.apache.org/
Apache Flink — https://flink.apache.org/
Apache Airflow — https://airflow.apache.org/
dbt — https://www.getdbt.com/
Debezium — https://debezium.io/
ClickHouse — https://clickhouse.com/docs/