Data Lakehouse: Combining Data Lake and Data Warehouse

Introduction

For years, data architects faced a difficult choice: use a data lake for flexibility and cost-effective storage of raw data, or use a data warehouse for structured, queryable, and governed data? The Data Lakehouse architecture solves this dilemma by combining the best of both worlds.

In this guide, we’ll explore how Data Lakehouse works, the key technologies (Delta Lake, Apache Iceberg, Hudi), and how to implement this modern data architecture.

What is a Data Lakehouse?

A Data Lakehouse is a new, open architecture that combines the flexibility of a data lake with the management features of a data warehouse. It enables:

ACID transactions: Reliable concurrent读写
Schema enforcement: Data quality at ingestion
Time travel: Query historical data versions
Mixed workloads: BI and ML on the same data
Open formats: Interoperability between tools
Cost efficiency: Store data in cheap object storage

┌─────────────────────────────────────────────────────────────────────┐
│                    DATA LAKEHOUSE ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                      APPLICATIONS                             │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │  │
│  │  │ BI/Analytics│ │ Data     │ │ ML/AI   │ │ Data     │    │  │
│  │  │ Dashboards │ │ Science  │ │ Models   │ │ Apps     │    │  │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                      COMPUTE LAYER                          │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │  │
│  │  │ Spark    │  │Trino/     │ │ DuckDB   │ │ Native   │    │  │
│  │  │          │  │Presto    │ │          │ │ Readers  │    │  │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    STORAGE LAYER                             │  │
│  │  ┌────────────────────────────────────────────────────┐     │  │
│  │  │              Lakehouse Table Format                  │     │  │
│  │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐        │     │  │
│  │  │  │Delta Lake│  │Apache    │  │ Apache   │        │     │  │
│  │  │  │          │  │ Iceberg   │  │ Hudi     │        │     │  │
│  │  │  └──────────┘  └──────────┘  └──────────┘        │     │  │
│  │  └────────────────────────────────────────────────────┘     │  │
│  │                                                             │  │
│  │  ┌────────────────────────────────────────────────────┐     │  │
│  │  │              Object Storage (S3/GCS/ADLS)           │     │  │
│  │  └────────────────────────────────────────────────────┘     │  │
│  └─────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

The Problem: Data Swamp vs Data Warehouse

Traditional Architecture Issues

┌─────────────────────────────────────────────────────────────────────┐
│               TRADITIONAL DUAL-SYSTEM APPROACH                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌─────────────┐                    ┌─────────────┐               │
│   │  Data Lake  │                    │   Data      │               │
│   │  (Raw Data) │                    │  Warehouse  │               │
│   │             │                    │ (Processed) │               │
│   │  • Cheap    │                    │ • Expensive │               │
│   │  • Flexible │  ── ETL/ELT ──►    │ • Structured│               │
│   │  • Any size │                    │ • Fast      │               │
│   │  • No ACID  │                    │   Queries   │               │
│   │  • "Data    │                    │             │               │
│   │    Swamp"   │                    │ • "Gold     │               │
│   │             │                    │   Standard" │               │
│   └─────────────┘                    └─────────────┘               │
│                                                                     │
│   Problems:                                                        │
│   • Data duplication (lake + warehouse)                            │
│   • ETL complexity and latency                                     │
│   • Limited to predefined schemas                                 │
│   • ML/AI requires separate pipelines                             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Technologies

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and Big Data workloads. It provides:

ACID transactions
Schema enforcement
Time travel
Upserts and deletes
Audit history

from delta import DeltaTable
from pyspark.sql import SparkSession

# Create a Spark session with Delta Lake
spark = SparkSession.builder \
    .appName("DeltaLakeExample") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Write data with Delta Lake
data = [
    ("Alice", 25, "NYC"),
    ("Bob", 30, "LA"),
    ("Charlie", 35, "NYC")
]

df = spark.createDataFrame(data, ["name", "age", "city"])

# Save as Delta table
df.write.format("delta") \
    .mode("overwrite") \
    .save("/mnt/delta/users")

# Read Delta table
delta_df = spark.read.format("delta").load("/mnt/delta/users")

# Upsert (merge) operation
def upsert_example(new_data_path):
    """Merge new data into existing Delta table"""
    
    new_data = spark.read.format("json").load(new_data_path)
    
    delta_table = DeltaTable.forPath(spark, "/mnt/delta/users")
    
    delta_table.alias("target").merge(
        new_data.alias("source"),
        "target.name = source.name"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()
    
    print("Upsert completed successfully")

# Time travel - read previous version
def time_travel_example():
    """Query historical data"""
    
    # Read current version
    current = spark.read.format("delta").load("/mnt/delta/users")
    
    # Read specific version
    version_0 = spark.read.format("delta") \
        .option("versionAsOf", 0) \
        .load("/mnt/delta/users")
    
    # Read timestamp
    yesterday = spark.read.format("delta") \
        .option("timestampAsOf", "2024-01-01") \
        .load("/mnt/delta/users")
    
    return current, version_0, yesterday

# Schema enforcement
def schema_enforcement_example():
    """Delta Lake enforces schema on write"""
    
    # This will fail if schema doesn't match
    new_data = spark.createDataFrame([
        ("David", "invalid_age", "NYC")  # age should be int
    ], ["name", "age", "city"])
    
    # Write with mode append - will fail due to schema mismatch
    try:
        new_data.write.format("delta") \
            .mode("append") \
            .save("/mnt/delta/users")
    except Exception as e:
        print(f"Schema enforcement prevented bad data: {e}")
    
    # Use schema evolution to add new columns
    new_data.write.format("delta") \
        .mode("append") \
        .option("mergeSchema", "true") \
        .save("/mnt/delta/users")

Apache Iceberg

Apache Iceberg is a table format for large analytic datasets that provides:

Schema evolution (add, rename, reorder columns)
Partition evolution
Hidden partitioning
Time travel
ACID transactions

from pyspark.sql import SparkSession

# Create Spark session with Iceberg
spark = SparkSession.builder \
    .appName("IcebergExample") \
    .config("spark.sql.catalog.iceberg_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.iceberg_catalog.type", "hive") \
    .getOrCreate()

# Create Iceberg table
spark.sql("""
    CREATE TABLE iceberg_catalog.sales (
        id BIGINT,
        product_id BIGINT,
        quantity INT,
        price DECIMAL(10,2),
        timestamp TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(timestamp))
""")

# Insert data
spark.sql("""
    INSERT INTO iceberg_catalog.sales
    VALUES 
        (1, 101, 2, 29.99, TIMESTAMP '2024-01-15 10:00:00'),
        (2, 102, 1, 49.99, TIMESTAMP '2024-01-15 11:00:00'),
        (3, 101, 3, 29.99, TIMESTAMP '2024-01-16 10:00:00')
""")

# Schema evolution example
def schema_evolution():
    """Iceberg supports schema evolution"""
    
    # Add column
    spark.sql("""
        ALTER TABLE iceberg_catalog.sales
        ADD COLUMNS category STRING
    """)
    
    # Rename column
    spark.sql("""
        ALTER TABLE iceberg_catalog.sales
        RENAME COLUMN quantity TO qty
    """)
    
    # Drop column
    spark.sql("""
        ALTER TABLE iceberg_catalog.sales
        DROP COLUMN price
    """)

# Time travel with Iceberg
def time_travel_iceberg():
    """Query historical data"""
    
    # Read specific snapshot
    df = spark.read.format("iceberg") \
        .option("snapshot-id", 1234567890) \
        .load("iceberg_catalog.sales")
    
    # Read as of timestamp
    df = spark.read.format("iceberg") \
        .option("as-of-timestamp", "1704067200000") \
        .load("iceberg_catalog.sales")
    
    # Read current snapshot
    df = spark.read.format("iceberg") \
        .load("iceberg_catalog.sales")
    
    return df

Apache Hudi

Apache Hudi provides:

Upserts and deletes
Incremental processing
Copy-on-write (CoW) vs Merge-on-read (MoR)
Timeline management
Clustering and compaction

from pyspark.sql import SparkSession

# Create Spark session with Hudi
spark = SparkSession.builder \
    .appName("HudiExample") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.hive.metastore.jars", "path/to/hive-jars") \
    .getOrCreate()

# Define Hudi table configuration
hudi_options = {
    "hoodie.table.name": "orders",
    "hoodie.table.type": "COPY_ON_WRITE",  # or MERGE_ON_READ
    "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.operation": "bulk_insert",
    "hoodie.datasource.write.precombine.field": "timestamp",
    "hoodie.datasource.write.partitionpath.field": "date",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.table": "orders",
    "hoodie.datasource.hive_sync.database": "default"
}

# Write data as Hudi table
data = [
    ("order_1", "product_a", 2, 59.98, "2024-01-15"),
    ("order_2", "product_b", 1, 29.99, "2024-01-15"),
    ("order_3", "product_c", 5, 149.95, "2024-01-16")
]

df = spark.createDataFrame(data, ["order_id", "product", "quantity", "total", "date"])

df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("overwrite") \
    .save("/mnt/hudi/orders")

# Incremental processing with Hudi
def incremental_processing():
    """Read only new data since last commit"""
    
    # Get commits after a specific time
    spark.read.format("hudi") \
        .load("/mnt/hudi/orders") \
        .filter("_hoodie_commit_time > '20240115000000')")

Comparison: Delta Lake vs Iceberg vs Hudi

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Created By	Databricks	Netflix, Apple	Uber
Spark Support	Native	Native	Native
Flink Support	Limited	Native	Native
Hive/Presto	Via connect	Native	Native
Schema Evolution	Yes	Yes (more flexible)	Yes
Time Travel	Yes	Yes	Yes
Merge-on-Read	No	Yes	Yes
Primary Key	No	Yes	Yes
ACID	Yes	Yes	Yes
Delete Updates	Yes	Yes	Yes
ML Support	Good	Good	Good

Data Lakehouse Architecture Patterns

Pattern 1: Simple Lakehouse

┌─────────────────────────────────────────────────────────────────────┐
│                     SIMPLE LAKEHOUSE                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Raw Data ──► Bronze (Raw) ──► Silver (Cleaned) ──► Gold (Agg)   │
│                                                                     │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Bronze: Raw ingestion (JSON, CSV, Parquet)                │  │
│   │  - Preserves original data                                  │  │
│   │  - Schema on read                                           │  │
│   │  - All source data retained                                 │  │
│   └─────────────────────────────────────────────────────────────┘  │
│                              │                                       │
│                              ▼                                       │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Silver: Cleaned and enriched                              │  │
│   │  - Schema enforced                                         │  │
│   │  - Deduplicated                                            │  │
│   │  - Enriched with lookups                                   │  │
│   │  - Business logic applied                                  │  │
│   └─────────────────────────────────────────────────────────────┘  │
│                              │                                       │
│                              ▼                                       │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Gold: Aggregated and ready for consumption                 │  │
│   │  - Business-level aggregates                                │  │
│   │  - Star schema dimensions                                   │  │
│   │  - ML-ready feature tables                                 │  │
│   └─────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Pattern 2: Streaming Lakehouse

┌─────────────────────────────────────────────────────────────────────┐
│                   STREAMING LAKEHOUSE                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌─────────────┐      ┌─────────────┐      ┌─────────────┐      │
│   │   Kafka     │─────►│   Spark     │─────►│  Delta Lake │      │
│   │  (Stream)   │      │  Streaming  │      │   (Table)   │      │
│   └─────────────┘─┘      └      └─────────────────────────┘      │
│                                                     │               │
│                                                     ▼               │
│                                            ┌─────────────┐          │
│                                            │  Streaming │          │
│                                            │    Apps    │          │
│                                            └─────────────┘          │
│                                                                     │
│   Also supports:                                                    │
│   ┌─────────────┐      ┌─────────────┐      ┌─────────────┐      │
│   │  Kafka     │─────►│   Flink     │─────►│   Iceberg   │      │
│   └─────────────┘      └─────────────┘      └─────────────┘      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Implementation Examples

Building a Lakehouse Pipeline

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from delta import DeltaTable

class DataLakehousePipeline:
    """
    Example: Building a complete lakehouse pipeline
    """
    
    def __init__(self, storage_path: str):
        self.storage_path = storage_path
        self.spark = self._create_spark_session()
    
    def _create_spark_session(self):
        return SparkSession.builder \
            .appName("LakehousePipeline") \
            .config("spark.sql.extensions", 
                    "io.delta.sql.DeltaSparkSessionExtension") \
            .config("spark.sql.catalog.spark_catalog", 
                    "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
            .getOrCreate()
    
    def ingest_bronze(self, source_path: str, table_name: str):
        """
        Ingest raw data into Bronze layer
        """
        df = self.spark.read.format("json").load(source_path)
        
        # Write to Bronze with schema on read
        df.write.format("delta") \
            .mode("append") \
            .partitionBy("ingestion_date") \
            .save(f"{self.storage_path}/bronze/{table_name}")
        
        return df
    
    def process_silver(self, table_name: str):
        """
        Clean and enrich data in Silver layer
        """
        bronze_df = self.spark.read.format("delta") \
            .load(f"{self.storage_path}/bronze/{table_name}")
        
        # Apply transformations
        silver_df = bronze_df \
            .filter(F.col("status") != "deleted") \
            .dropDuplicates(["id"]) \
            .withColumn("processed_date", F.current_date()) \
            .withColumn("name_upper", F.upper(F.col("name")))
        
        # Write to Silver with schema enforcement
        silver_df.write.format("delta") \
            .mode("overwrite") \
            .option("mergeSchema", "true") \
            .save(f"{self.storage_path}/silver/{table_name}")
        
        return silver_df
    
    def aggregate_gold(self, table_name: str):
        """
        Create aggregated tables in Gold layer
        """
        silver_df = self.spark.read.format("delta") \
            .load(f"{self.storage_path}/silver/{table_name}")
        
        # Create aggregates
        gold_df = silver_df.groupBy("category", "processed_date") \
            .agg(
                F.count("*").alias("count"),
                F.sum("amount").alias("total_amount"),
                F.avg("amount").alias("avg_amount")
            )
        
        gold_df.write.format("delta") \
            .mode("overwrite") \
            .partitionBy("category") \
            .save(f"{self.storage_path}/gold/{table_name}_summary")
        
        return gold_df
    
    def enable_cdc(self, table_name: str):
        """
        Enable Change Data Capture on Silver table
        """
        delta_table = DeltaTable.forPath(
            self.spark, 
            f"{self.storage_path}/silver/{table_name}"
        )
        
        # Get changes since last version
        cdc_df = delta_table.toDF()
        
        # Or use the changes API (Spark 3.2+)
        changes_df = self.spark.read.format("delta") \
            .option("readChangeData", "true") \
            .option("startingVersion", 0) \
            .load(f"{self.storage_path}/silver/{table_name}")
        
        return changes_df

# Usage example
pipeline = DataLakehousePipeline("s3://my-lakehouse/")

# Ingest raw data
pipeline.ingress_bronze("s3://raw-data/events/", "events")

# Clean and process
pipeline.process_silver("events")

# Create aggregates
pipeline.aggregate_gold("events")

Time Travel and Audit

def lakehouse_time_travel():
    """
    Demonstrating time travel capabilities
    """
    spark = SparkSession.builder.getOrCreate()
    
    table_path = "s3://my-lakehouse/silver/customers"
    
    # 1. View all versions
    delta_table = DeltaTable.forPath(spark, table_path)
    history = delta_table.history()
    
    # Show history
    # +-------------------+------+------------+--------------------+
    # | version| timestamp| user| operation| operationParameters|
    # +--------+----------+-----+---------+--------------------+
    # |       2|2024-01-16| ... |    WRITE|               {..} |
    # |       1|2024-01-15| ... |    WRITE|               {..} |
    # |       0|2024-01-14| ... |    WRITE|               {..} |
    # +--------+----------+-----+---------+--------------------+
    
    # 2. Read specific version
    df_v0 = spark.read.format("delta") \
        .option("versionAsOf", 0) \
        .load(table_path)
    
    # 3. Read before a timestamp
    df_before = spark.read.format("delta") \
        .option("timestampAsOf", "2024-01-15T00:00:00") \
        .load(table_path)
    
    # 4. Get changed files between versions
    changes = delta_table.differecesAsOf(
        spark, 
        versionAsOf=0, 
        other=table_path
    )
    
    # 5. Restore to previous version
    delta_table.restoreToVersion(0)
    
    return history

Common Pitfalls

1. Not Using Appropriate Partitioning

# Anti-pattern: Wrong partitioning strategy
def bad_partitioning():
    """
    Anti-pattern: Partitioning by high-cardinality field
    """
    df.write.format("delta") \
        .partitionBy("user_id")  # Too many partitions!
        .save("table")
    
    # Result: Too many small files, poor query performance

# Good pattern: Partition by date/low-cardinality
def good_partitioning():
    """
    Partition by date, use bucketing for high-cardinality
    """
    df.write.format("delta") \
        .partitionBy("date", "region") \
        .bucketBy(256, "user_id") \
        .save("table")

2. Too Many Small Files

# Anti-pattern: Writing many small files
def bad_file_size():
    """
    Anti-pattern: Writing without coalescing
    """
    # Each partition writes separately → many small files
    df.write.format("delta") \
        .mode("overwrite") \
        .save("table")
    
    # Result: Poor query performance, slow metadata operations

# Good pattern: Optimize file size
def good_file_size():
    """
    Optimize file size during write
    """
    df.repartition(100) \
        .write.format("delta") \
        .option("maxRecordsPerFile", 100000) \
        .mode("overwrite") \
        .save("table")
    
    # Or optimize after the fact
    DeltaTable.forPath(spark, "table") \
        .optimize() \
        .executeCompaction()

3. Skipping Schema Enforcement

# Anti-pattern: No schema validation
def bad_schema():
    """
    Anti-pattern: Allowing any schema
    """
    df.write.format("delta") \
        .mode("append") \
        .save("table")
    
    # Result: Corrupted data, silent failures

# Good pattern: Enforce schema
def good_schema():
    """
    Enforce schema on write
    """
    df.write.format("delta") \
        .mode("append") \
        .option("mergeSchema", "false") \
        .save("table")
    
    # Use schema evolution when needed
    new_df.write.format("delta") \
        .mode("append") \
        .option("mergeSchema", "true") \
        .save("table")

Best Practices

1. Follow the Medallion Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    MEDALLION ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  BRONZE (Raw)      SILVER (Filtered)      GOLD (Aggregated)        │
│  ────────────      ────────────────       ─────────────────         │
│  Raw source data   Cleaned, validated     Business-level           │
│  All fields       Enriched               aggregates                │
│  One file per     Partitioned by          Partitioned by           │
│    ingestion        date/region             business key            │
│  Schema on read   Schema enforced         Star schema              │
│                                                                     │
│  Use: ML training,  Use: Analytics,       Use: BI dashboards,       │
│       Audit          Reporting              Executive reports       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2. Implement Proper Data Governance

def implement_governance():
    """
    Governance best practices for Lakehouse
    """
    
    # 1. Enable column-level encryption for sensitive data
    # 2. Implement row-level security
    
    # 3. Use Unity Catalog or similar for central governance
    spark.sql("""
        CREATE CATALOG IF NOT EXISTS lakehouse_catalog
        USING DELTA
    """)
    
    # 4. Set up access controls
    # 5. Enable audit logging
    # 6. Use tags for classification

3. Optimize for Workload Patterns

def optimize_workloads():
    """
    Optimize based on query patterns
    """
    
    # For frequently filtered columns - use Z-ordering
    DeltaTable.forPath(spark, "table") \
        .optimize() \
        .executeZOrderBy("date", "region")
    
    # For frequently joined columns - use bucketing
    df.write.format("delta") \
        .bucketBy(256, "customer_id") \
        .sortBy("date") \
        .save("table")
    
    # Use liquid clustering for automatic optimization
    # (Delta Lake 3.0+)

External Resources

Conclusion

Data Lakehouse represents a significant advancement in data architecture, combining the flexibility and cost-efficiency of data lakes with the reliability and performance of data warehouses. By leveraging open table formats like Delta Lake, Apache Iceberg, and Apache Hibu, organizations can build modern data platforms that support diverse workloads from BI analytics to machine learning.

Key takeaways:

Choose the right table format based on your ecosystem and requirements
Follow the medallion architecture (Bronze/Silver/Gold) for organized data flow
Implement proper partitioning and file sizing for performance
Enable governance and security at the table level
Take advantage of time travel and ACID transactions for data reliability

The Lakehouse architecture is not just a technology choice—it’s a paradigm shift that enables organizations to unify their data infrastructure and accelerate data-driven decision making.