Data Engineering Hub
Practical guidance for building reliable, observable, and cost-effective data pipelines and platforms. This hub covers batch and streaming ETL/ELT, data lakehouse patterns, orchestration, real-time processing, data quality, governance, and the tools widely used in 2025โ2026.
๐ Getting started
New to data engineering? Start with these concise, high-value guides from the collection below:
- ETL vs ELT: Modern Data Integration Patterns โ when to use ETL vs ELT
- Data Pipeline Orchestration: Complete Guide โ orchestration concepts and tool comparisons
- Stream Processing: Kafka & Flink โ core streaming concepts and patterns
- Data Lakehouse Architecture: Complete Guide โ lakehouse patterns, engines, and trade-offs
๐ Grouped article index
All articles in content/data-engineering/ are organized below by topic area. Links use the file slugs (filename without .md). If an article’s link text should differ, update its title frontmatter. To regenerate this index programmatically use ./scripts/update_index.py content/data-engineering --dry-run and then run without --dry-run to write.
๐งฑ Core Concepts & Overviews
- Big Data Technologies: Introduction
- Data Engineering Fundamentals
- ETL vs ELT: Modern Data Integration Patterns
- Data Pipeline Orchestration: Complete Guide
- Data Pipeline Orchestration: Airflow, Prefect, Dagster
๐๏ธ Storage, Lakehouse & Warehouses
- Data Lakehouse: Complete Guide
- Data Lakehouse: Delta Lake & Iceberg
- Data Warehouse Modernization: Cloud-Native Patterns
- Object Storage & Data Access patterns
โ๏ธ Orchestration & Workflow
- Airflow: DAG Patterns & Best Practices
- Airflow Ops: Scaling, Executors, and Backfills
- Data Pipeline Orchestration: Complete Guide
โก Streaming & Real-time
- Stream Processing: Kafka & Flink
- Realtime Analytics: ClickHouse, Druid, Materialized Views
- Change Data Capture (CDC): Complete Guide
๐งฉ Transformation & Modeling
- [dbt-style transformations and modeling โ (see transformation articles)](/update slug if dbt-specific articles are added/)
- Data Modeling for Analytics (Dimensional Modeling)
- Time Series Analysis: Introduction
๐งช Data Quality, Observability & Testing
- Data Quality Management: Complete Guide
- Data Quality Validation: Monitoring & Observability
- Monitoring Data Pipelines: Metrics & Alerts
๐ Governance, Catalog & Access
๐งญ Architecture Patterns & Platforms
- Data Mesh: Domain-Owned Data Platforms
- Data Mesh: Implementation โ Complete Guide
- [Platform engineering patterns for data teams โ (see platform articles)](/add slug if present/)
๐ง Machine Learning & MLOps
- MLOps: Fundamentals for Data Engineers
- ML Pipeline Automation for Data Engineers
- Privacy-Preserving Machine Learning
- Introduction to NLP
- Neural Networks & Deep Learning Fundamentals
๐ง Tools & Engine-Specific Guides
- Apache Spark: Big Data Processing
- Spark vs Flink: Choosing a Stream/Batch Engine
- ClickHouse for High-Performance Analytics
๐ Career & Miscellaneous
๐ Index maintenance
- To regenerate the index automatically:
./scripts/update_index.py content/data-engineering --include-description(dry-run first to preview). - After changes to article frontmatter (title, categories, tags), re-run the index generator and verify with
hugo --quiet. - Keep the index grouped logically โ add new articles to the most relevant group and keep the lists alphabetical within each group.
๐ Who this hub is for
- Data engineers building and operating ETL/ELT and streaming systems
- Platform engineers creating self-service data platforms and pipelines
- Analytics engineers using dbt and SQL to produce reliable BI datasets
- SREs and operators responsible for pipeline reliability and cost control
- Product engineers integrating real-time data into user-facing features
๐ External resources
- Apache Kafka โ https://kafka.apache.org/
- Apache Flink โ https://flink.apache.org/
- Apache Airflow โ https://airflow.apache.org/
- dbt โ https://www.getdbt.com/
- Debezium โ https://debezium.io/
- ClickHouse โ https://clickhouse.com/docs/
Comments