Skip to main content
โšก Calmops

ETL vs ELT: Modern Data Integration Patterns

Introduction

The debate between ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) has evolved significantly with the advent of cloud data warehouses. Understanding when to use each approach is crucial for building efficient, scalable data infrastructure.

ETL vs ELT Overview

Aspect ETL ELT
Order Extract โ†’ Transform โ†’ Load Extract โ†’ Load โ†’ Transform
Data Storage Staging in ETL tool Raw in warehouse
Transformation Location ETL server Data warehouse
Flexibility Less flexible Highly flexible
Historical Data Difficult to reprocess Easy to reprocess
Best For Sensitive data, simple transforms Large volumes, complex analytics

When to Use ETL

Use Cases

  • Data privacy requirements: Transform PII before reaching warehouse
  • Simple transformations: Minimal processing needed
  • Legacy systems: Integration with existing infrastructure
  • Real-time requirements: Lower latency processing

ETL Implementation

# etl_pipeline.py
import pandas as pd
from sqlalchemy import create_engine

class ETLPipeline:
    def __init__(self, source_config, target_config):
        self.source = create_engine(source_config['connection'])
        self.target = create_engine(target_config['connection'])
    
    def extract(self, query: str) -> pd.DataFrame:
        """Extract data from source."""
        return pd.read_sql(query, self.source)
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Transform data."""
        # Clean PII
        df = df.drop(columns=['ssn', 'credit_card'], errors='ignore')
        
        # Standardize formats
        df['created_at'] = pd.to_datetime(df['created_at'])
        
        # Remove duplicates
        df = df.drop_duplicates()
        
        # Add derived fields
        df['year'] = df['created_at'].dt.year
        df['month'] = df['created_at'].dt.month
        
        return df
    
    def load(self, df: pd.DataFrame, table_name: str):
        """Load to target."""
        df.to_sql(table_name, self.target, if_exists='replace', index=False)
    
    def run(self, query: str, table_name: str):
        """Execute full ETL pipeline."""
        print("Extracting data...")
        df = self.extract(query)
        
        print("Transforming data...")
        df = self.transform(df)
        
        print(f"Loading {len(df)} rows...")
        self.load(df, table_name)
        
        print("ETL pipeline complete!")

When to Use ELT

Use Cases

  • Large data volumes: Leverage warehouse compute
  • Ad-hoc analysis: Raw data available for multiple use cases
  • Complex transformations: SQL-based transformations
  • Frequent schema changes: Flexible schema-on-read
  • Data team autonomy: Analysts can self-serve

ELT Implementation with dbt

# dbt_project.yml
name: analytics_warehouse
version: '1.0.0'

models:
  analytics:
    +materialized: table
    
staging:
  +materialized: view
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

SELECT
    order_id,
    customer_id,
    CAST(total_amount AS DECIMAL(10,2)) AS total_amount,
    status,
    CAST(created_at AS TIMESTAMP) AS created_at,
    -- PII removed at transformation layer
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at) AS order_sequence
FROM {{ source('raw', 'orders_raw') }}
WHERE created_at >= '2024-01-01'
-- models/marts/fct_orders.sql
{{ config(materialized='table') }}

WITH customer_orders AS (
    SELECT
        customer_id,
        COUNT(*) AS total_orders,
        SUM(total_amount) AS lifetime_value,
        MIN(created_at) AS first_order_date,
        MAX(created_at) AS last_order_date
    FROM {{ ref('stg_orders') }}
    GROUP BY customer_id
)

SELECT
    o.*,
    c.lifetime_value,
    c.total_orders,
    DATEDIFF(day, c.first_order_date, c.last_order_date) AS customer_tenure_days,
    CASE 
        WHEN c.total_orders >= 10 THEN 'VIP'
        WHEN c.total_orders >= 5 THEN 'Regular'
        ELSE 'New'
    END AS customer_segment
FROM {{ ref('stg_orders') }} o
LEFT JOIN customer_orders c ON o.customer_id = c.customer_id

Data Ingestion Tools

Airbyte (ELT)

# airbyte_connection.yaml
source:
  name: postgres
  config:
    host: "source-db.example.com"
    port: 5432
    database: "sales"
    username: "replication_user"
    
destination:
  name: bigquery
  config:
    project_id: "analytics-project"
    dataset_id: "staging"

sync_mode: incremental
stream:
  - name: orders
    cursor_field: updated_at
    destination_sync_mode: append_dedup

Fivetran

# Fivetran uses declarative configuration
# connectors/fivetran_connector.json
{
  "schema": "salesforce",
  "tables": [
    {
      "name": "account",
      "sync_mode": "incremental",
      "destination": "postgres"
    },
    {
      "name": "opportunity",
      "sync_mode": "incremental"
    }
  ]
}

Implementation Checklist

  • Assess data volume and transformation complexity
  • Choose approach based on use case
  • Select appropriate tools
  • Design transformation layer (dbt recommended)
  • Implement data quality checks
  • Set up monitoring and alerting

Summary

  1. ETL: Use when data privacy is critical, transformations are simple, or dealing with legacy systems

  2. ELT: Use for cloud-native warehouses with large volumes and complex analytical needs

  3. Modern Approach: Most teams use ELT with dbt for transformation, keeping raw data accessible while building curated marts


External Resources

Comments