Skip to main content
โšก Calmops

Data Lakehouse Architecture: Complete Guide

Introduction

The data lakehouse represents the next evolution in data architecture, combining the best of data lakes and data warehouses into a unified platform. This comprehensive guide covers lakehouse architecture, implementation strategies, and best practices for building modern data infrastructure.

Key Statistics:

  • 85% of enterprises will adopt lakehouse architecture by 2027
  • Lakehouse platforms reduce data infrastructure costs by 40-60%
  • Delta Lake powers over 1 billion queries daily
  • Apache Iceberg is used by Netflix, Apple, and Airbnb for petabyte-scale tables

Understanding Data Lakehouse

What is a Data Lakehouse?

A data lakehouse combines the flexibility of data lakes with the reliability of data warehouses, enabling both BI and advanced analytics on a single platform.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Data Lakehouse Architecture                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚                    Lakehouse Platform                    โ”‚   โ”‚
โ”‚   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚   โ”‚
โ”‚   โ”‚  โ”‚    Storage    โ”‚  โ”‚   Compute     โ”‚  โ”‚   Metadata  โ”‚ โ”‚   โ”‚
โ”‚   โ”‚  โ”‚   (S3/ADLS)   โ”‚  โ”‚  (Spark/Trino)โ”‚  โ”‚   (ACID)    โ”‚ โ”‚   โ”‚
โ”‚   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚   โ”‚
โ”‚   โ”‚           โ”‚                  โ”‚                  โ”‚         โ”‚   โ”‚
โ”‚   โ”‚           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚   โ”‚
โ”‚   โ”‚                              โ”‚                            โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                  โ”‚                                 โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚                    Workloads                             โ”‚   โ”‚
โ”‚   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚   โ”‚
โ”‚   โ”‚  โ”‚ BI/SQL  โ”‚  โ”‚   ML    โ”‚  โ”‚ Streamingโ”‚  โ”‚ Data Scienceโ”‚  โ”‚   โ”‚
โ”‚   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                  โ”‚
โ”‚   Key Benefits:                                                  โ”‚
โ”‚   โœ“ ACID transactions    โœ“ Time travel    โœ“ Schema evolution   โ”‚
โ”‚   โœ“ Open formats         โœ“ Low cost       โœ“ Unified analytics  โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Lakehouse vs Data Lake vs Data Warehouse

Feature Data Warehouse Data Lake Lakehouse
Schema Schema-on-write Schema-on-read Schema-on-write + flex
Data Types Structured Any Any + structured
ACID Yes No Yes
Time Travel Limited No Yes
Cost High Low Medium
Use Cases BI, Reporting ML, Advanced Unified

Core Lakehouse Technologies

Delta Lake

Delta Lake provides ACID transactions, time travel, and schema enforcement on data lakes:

from delta import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLakeDemo") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Create Delta Table
data = spark.range(100)
data.write.format("delta").save("/delta/events")

# Update table (ACID transaction)
deltaTable = DeltaTable.forPath(spark, "/delta/events")
deltaTable.update(
    condition=expr("id % 2 == 0"),
    set={"value": expr("value + 100")}
)

# Time travel - read previous version
df_v1 = spark.read.format("delta").option("versionAsOf", 1).load("/delta/events")
df_timestamp = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/delta/events")

# Merge (upsert)
source_df.write.format("delta").mode("append").save("/delta/events")
deltaTable.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set={"value": "source.value"}) \
 .whenNotMatchedInsert(values={"id": "source.id", "value": "source.value"}) \
 .execute()

# Vacuum old files
deltaTable.vacuum(retentionHours=168)

Apache Iceberg

Apache Iceberg provides open-table format with powerful features:

-- Iceberg table creation
CREATE TABLE analytics.events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id STRING,
    event_type STRING,
    properties MAP) USING<STRING, STRING>
 iceberg
PARTITIONED BY (days(event_time), bucket(16, user_id))
TBLPROPERTIES (
    'format-version' = '2',
    'write.distribution-mode' = 'hash'
);

-- Time travel queries
SELECT * FROM analytics.events VERSION AS OF 123456789;
SELECT * FROM analytics.events TIMESTAMP AS OF '2024-01-01 00:00:00';

-- Incremental reads (change data capture)
SELECT * FROM analytics.events 
    WHERE _change_type IN ('insert', 'update_after')
    AND _change_ordinal > (SELECT MAX(_change_ordinal) FROM previous_batch);

Lakehouse Implementation

Architecture Design

# Lakehouse Architecture Components
lakehouse_layers:
  ingestion:
    tools:
      - "Apache Kafka (streaming)"
      - "Debezium (CDC)"
      - "Airbyte (ELT)"
      - "Fivetran (managed)"
    patterns:
      - "Batch ingestion (hourly/daily)"
      - "CDC from databases"
      - "Event streaming"
  
  storage:
    format: "Delta Lake / Apache Iceberg"
    locations:
      - "Bronze (raw data)"
      - "Silver (cleaned, deduplicated)"
      - "Gold (business-level aggregates)"
    storage_backend:
      - "S3 (AWS)"
      - "ADLS Gen2 (Azure)"
      - "GCS (Google Cloud)"
  
  compute:
    engines:
      - "Apache Spark (batch)"
      - "Trino/Presto (ad-hoc SQL)"
      - "dbt (transformation)"
      - "Flink (streaming)"
  
  serving:
    tools:
      - "Apache Superset"
      - "Power BI"
      - "SageMaker"
      - "Databricks"

Data Pipeline Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_timestamp, md5

def create_lakehouse_pipeline():
    spark = SparkSession.builder.getOrCreate()
    
    # BRONZE: Raw ingestion
    def ingest_bronze():
        # Read from streaming source
        raw_df = spark.readStream \
            .format("kafka") \
            .option("kafka.bootstrap.servers", "localhost:9092") \
            .option("subscribe", "events") \
            .load()
        
        # Write raw to bronze with watermark
        raw_df.writeStream \
            .format("delta") \
            .option("checkpointLocation", "/checkpoints/bronze") \
            .partitionBy("date") \
            .start("/bronze/events")
    
    # SILVER: Cleaning and deduplication
    def process_silver():
        bronze_df = spark.readStream \
            .format("delta") \
            .table("bronze_events")
        
        silver_df = bronze_df \
            .select(
                col("event_id"),
                to_timestamp(col("event_time")).alias("event_time"),
                col("user_id"),
                col("event_type"),
                col("properties")
            ) \
            .dropDuplicates(["event_id"])
        
        silver_df.writeStream \
            .format("delta") \
            .option("checkpointLocation", "/checkpoints/silver") \
            .mergeSchema("append") \
            .start("/silver/events")
    
    # GOLD: Business aggregations
    def process_gold():
        silver_df = spark.readStream \
            .format("delta") \
            .table("silver_events")
        
        gold_df = silver_df \
            .groupBy("date", "event_type") \
            .agg(
                count("*").alias("event_count"),
                countDistinct("user_id").alias("unique_users")
            )
        
        gold_df.writeStream \
            .format("delta") \
            .option("checkpointLocation", "/checkpoints/gold") \
            .outputMode("complete") \
            .start("/gold/daily_metrics")

Data Mesh and Lakehouse

Data Mesh with Lakehouse

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Data Mesh Architecture                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚                  Domain Teams (Product)                   โ”‚  โ”‚
โ”‚   โ”‚                                                              โ”‚
โ”‚   โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚   โ”‚Product  โ”‚   โ”‚  Sales  โ”‚   โ”‚Marketingโ”‚   โ”‚ Support โ”‚  โ”‚
โ”‚   โ”‚   โ”‚ Domain  โ”‚   โ”‚ Domain  โ”‚   โ”‚ Domain  โ”‚   โ”‚ Domain  โ”‚  โ”‚
โ”‚   โ”‚   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚   โ”‚        โ”‚             โ”‚             โ”‚             โ”‚        โ”‚
โ”‚   โ”‚        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚   โ”‚                      โ”‚             โ”‚                       โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚                          โ”‚             โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   โ”‚         Platform Team (Lakehouse Infrastructure)          โ”‚
โ”‚   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚  โ”‚  Shared Lakehouse (Delta Lake / Iceberg)            โ”‚  โ”‚
โ”‚   โ”‚  โ”‚  โ€ข Catalog    โ€ข Governance    โ€ข Quality             โ”‚  โ”‚
โ”‚   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚                                                                  โ”‚
โ”‚   Principles:                                                    โ”‚
โ”‚   1. Domain ownership                                            โ”‚
โ”‚   2. Data as a product                                           โ”‚
โ”‚   3. Self-serving platform                                      โ”‚
โ”‚   4. Federated computational governance                          โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Best Practices

  1. Use tiered storage: Bronze โ†’ Silver โ†’ Gold for data quality
  2. Enable time travel: Keep history for debugging and compliance
  3. Implement schema evolution: Handle changing data schemas gracefully
  4. Optimize partitioning: Partition by frequently queried columns
  5. Use Z-ordering: Co-locate related data for faster queries
  6. Implement data quality checks: Validate data at each layer
  7. Use checkpoints: Ensure exactly-once processing
  8. Vacuum old files: Manage storage costs

Common Pitfalls

  • Skipping data quality: Not validating data at bronze level
  • Poor partitioning: Creating too many small files
  • Ignoring compaction: Letting small files accumulate
  • No governance: Missing access controls and auditing
  • Over-optimizing: Premature optimization before understanding workloads

Tools and Technologies

Category Tools
Table Formats Delta Lake, Apache Iceberg, Apache Hudi
Compute Spark, Trino, Presto, Flink
Orchestration Airflow, Dagster, Prefect
Data Quality Great Expectations, dbt tests
BI/Visualization Superset, Power BI, Tableau

Conclusion

The data lakehouse architecture provides the best of both worlds: flexibility and low-cost storage of data lakes with the reliability and performance of data warehouses. By implementing proper table formats like Delta Lake or Apache Iceberg, organizations can build robust data platforms that support diverse workloads while maintaining data quality and governance.

Comments