Skip to main content

Data Warehouse Modernization: From Legacy Systems to Cloud-Native Architecture

Created: March 8, 2026 CalmOps 20 min read

Introduction

Your legacy Teradata appliance is nearing end-of-life, your Oracle Exadata license renewal is due, and your analytics team is waiting hours for queries to finish. You are not alone. Organizations running on-premises data warehouses face mounting pressure to modernize — not because cloud is trendy, but because the gap between what modern analytics demands and what legacy systems deliver has become untenable. This article walks through every dimension of data warehouse modernization: why legacy systems fail, what cloud-native architectures offer, how to migrate safely, and what pitfalls to avoid. You will come away with a concrete migration playbook you can adapt to your organization.

Legacy Data Warehouse Challenges

Teradata: Massive Parallel Processing Limitations

Teradata pioneered massively parallel processing (MPP) for data warehousing. Its shared-nothing architecture distributes data across AMPs (Access Module Processors), each handling a slice of the data. This design works well for predictable, batch-oriented workloads. It breaks down when you need elasticity.

-- Teradata: adding a node requires redistributing all data
-- This operation can take days for multi-terabyte deployments
ALTER SYSTEM ADD NODE 'node5';
-- After this, you must redistribute data across all AMPs
-- Downtime is typically required

The procurement cycle for Teradata hardware runs 8-16 weeks. If your data volume grows 40% year-over-year, you must forecast capacity 18 months out and buy for the peak. That is capital expenditure at its most wasteful — you pay for hardware that sits idle 60-70% of the time.

Oracle Exadata: Licensing and Lock-In

Oracle Exadata combines specialized hardware with Oracle Database to deliver high-throughput analytics. The catch: licensing costs scale linearly (or worse) with data volume. A typical Exadata deployment with 100 TB of usable storage costs $500,000-$1,000,000 per year in licensing alone, plus hardware maintenance.

-- Oracle: partitioning helps but requires careful design
CREATE TABLE sales (
    sale_id      NUMBER,
    sale_date    DATE,
    amount       NUMBER
)
PARTITION BY RANGE (sale_date) (
    PARTITION p2024 VALUES LESS THAN (TO_DATE('2025-01-01','YYYY-MM-DD')),
    PARTITION p2025 VALUES LESS THAN (TO_DATE('2026-01-01','YYYY-MM-DD')),
    PARTITION p_max VALUES LESS THAN (MAXVALUE)
);
-- Partition pruning helps, but you still pay license fees per core
-- Adding partitions requires DDL locks on the table

Oracle’s pricing model punishes growth. More data means more storage, which means more cores, which means exponentially higher licensing. Organizations often find themselves paying more for Oracle licenses than for the hardware or cloud infrastructure itself.

IBM Netezza: Proprietary Hardware

IBM Netezza used FPGA-accelerated appliances that were fast for their era but impossible to scale incrementally. The TwinFin and Stripe nodes were fixed configurations. You bought a node, bolted it on, and hoped the data redistribution completed before your next batch window.

# Netezza administration: checking node health
nzstate -all
# Output: Node0: Online, Node1: Online, Node2: DEGRADED
# A degraded node means the entire appliance runs slower
# Replacing a node requires IBM field service — 4-8 hour SLA minimum

Netezza’s end-of-life announcements drove many organizations to urgent modernization programs. When IBM stopped selling Netezza appliances in 2015 and ended support, companies with petabyte-scale Netezza deployments had to replatform with no migration path from IBM.

Common Legacy Pain Points

Beyond vendor-specific issues, legacy warehouses share structural problems:

Rigid scaling. All three platforms require adding hardware nodes in fixed increments. You cannot scale compute without scaling storage, even when you only need more query throughput.

Mixed workload contention. Legacy systems cannot isolate ETL workloads from BI queries. A long-running aggregation job slows dashboards for everyone.

Disaster recovery complexity. Netezza and Teradata disaster recovery requires active-passive standby deployments that double infrastructure cost. Oracle Data Guard helps but adds licensing complexity.

Limited semi-structured data support. Legacy systems were designed for relational data. JSON, Parquet, and Avro require complex ETL transformations before loading.

Query performance degradation at scale. As data grows, query performance degrades non-linearly because statistics become stale, partitions grow uneven, and the optimizer cannot keep up.

-- Typical legacy pain point: stale statistics ruin query performance
-- In Teradata:
COLLECT STATISTICS ON sales INDEX (sale_date);
-- If you forget to recollect after large loads, the optimizer makes bad decisions
-- and queries that ran in 2 minutes now run in 2 hours

Cloud-Native Architecture: Storage-Compute Separation

The defining architectural shift in cloud data warehouses is separating storage from compute. In legacy systems, storage and compute are coupled — each node has disks and CPUs, and adding capacity means adding both. Cloud-native warehouses decouple the two, enabling each to scale independently.

How Storage-Compute Separation Works

Cloud data warehouses store data in object storage (Amazon S3, Google Cloud Storage, Azure Blob Storage) and use a separate compute cluster for query execution. The storage layer is shared and durable; the compute layer is ephemeral and elastic.

                ┌─────────────────────┐
                │   Query Layer        │
                │  (Virtual Warehouses) │
                │  ┌───┐ ┌───┐ ┌───┐  │
                │  │ VW│ │ VW│ │ VW│  │
                │  │ 1  │ │ 2  │ │ 3  │  ← independent scaling
                │  └───┘ └───┘ └───┘  │
                └─────────┬───────────┘
                          │
                ┌─────────┴───────────┐
                │  Metadata & Caching  │
                │  (Cloud Service)     │
                └─────────┬───────────┘
                          │
                ┌─────────┴───────────┐
                │  Storage Layer       │
                │  (S3/GCS/Blob)      │
                │  ┌───┐ ┌───┐ ┌───┐  │
                │  │ P │ │ P │ │ P │  │  ← compressed, columnar
                │  │ a  │ │ a  │ │ a  │  ← Parquet/ORC files
                │  │ r  │ │ r  │ │ r  │
                │  │ t  │ │ t  │ │ t  │
                │  └───┘ └───┘ └───┘  │
                └─────────────────────┘

Benefits include:

  • Compute elasticity. Spin up 10X compute for a monthly report, then shut it down. Pay only for the seconds it runs.
  • Storage durability. Object storage provides 99.9999999999% durability. No RAID, no replication management.
  • Workload isolation. Run ETL on one compute cluster, BI on another, data science on a third. They share the same data without contention.
  • Instant clones. Create zero-copy clones of multi-terabyte databases for development, testing, or what-if analysis.
-- Snowflake: create a zero-copy clone for testing
CREATE DATABASE prod_db CLONE;
CREATE DATABASE dev_db CLONE prod_db;
-- The clone consumes no additional storage until you modify data
-- This is physically impossible on Teradata or Oracle
-- Snowflake: independent compute scaling
CREATE WAREHOUSE etl_wh
  WAREHOUSE_SIZE = 'X-LARGE'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

CREATE WAREHOUSE bi_wh
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 300
  AUTO_RESUME = TRUE;
-- ETL warehouse and BI warehouse share the same data
-- but never compete for compute resources

Cloud Data Warehouse Comparison

Feature Snowflake Google BigQuery Amazon Redshift Databricks SQL
Architecture Storage-compute separated Serverless (Dremel) Mixed (RA3: separated) Delta Lake + Photon
Scaling Auto-scale virtual warehouses Automatic slot-based Elastic resize + concurrency scaling Auto-scaling clusters
Storage format Proprietary compressed Capacitor columnar RA3 managed storage Delta Lake (Parquet)
Pricing model Compute + storage (per sec) Per-slot or on-demand (per TB) Per-node hourly + storage DBU per hour
Data sharing Reader accounts + Marketplace Authorized views + Analytics Hub Datashare Delta Sharing
Semi-structured VARIANT type (JSON, Avro, XML) RECORD/REPEATED + JSON functions SUPER type STRUCT/ARRAY + Spark
Time travel 1-90 days 7 days (default) 1-2 days (snapshots) Delta time travel (configurable)
Partitioning Automatic micro-partitions Automatic (ingestion time) Distribution + sort keys Delta partitioning + Z-order
Minimum compute Resume from 0 0 (serverless) Minimum 2 nodes Depends on cluster config
Query language SQL + Snowpark (Python/Java) SQL + UDFs (JS/SQL/Python) SQL + Python UDFs SQL + Python + R + Scala
Multi-cloud AWS, GCP, Azure GCP only AWS only AWS, GCP, Azure

When Each Platform Excels

Snowflake suits organizations that value simplicity and data sharing. Its separation of storage and compute is the cleanest implementation. The VARIANT data type handles JSON without schema definition, making it ideal for semi-structured data.

BigQuery excels for serverless, high-throughput analytics. There is no infrastructure to manage — no clusters, no warehouses, no nodes. It is the best choice for Google Cloud-native shops and organizations that want zero ops overhead.

Redshift works well for AWS-native organizations with predictable, heavy ETL workloads. RA3 nodes provide storage-compute separation, and Redshift Spectrum queries external data in S3 directly. Redshift provides the best price-performance for consistently large workloads.

Databricks SQL (Photon engine) excels when your data pipeline already uses Apache Spark. The Delta Lake format provides ACID transactions, schema enforcement, and time travel on data lake storage. It bridges data warehousing and data science on a single platform.

# Python: querying across cloud warehouses with SQLAlchemy
from sqlalchemy import create_engine

def get_warehouse_connection(warehouse_type: str, config: dict):
    """Return SQLAlchemy engine for the specified warehouse."""
    connections = {
        "snowflake": f"snowflake://{config['user']}:{config['password']}@{config['account']}/{config['database']}",
        "redshift": f"redshift+psycopg2://{config['user']}:{config['password']}@{config['host']}:5439/{config['database']}",
        "bigquery": f"bigquery://{config['project_id']}"
    }
    engine = create_engine(connections[warehouse_type])
    return engine

# Usage: run the same query across platforms
query = """
    SELECT DATE_TRUNC('month', order_date) AS month,
           COUNT(*) AS order_count,
           SUM(total) AS revenue
    FROM orders
    WHERE order_date >= '2026-01-01'
    GROUP BY 1
    ORDER BY 1
"""

Migration Strategies

Organizations choose from four migration approaches, each with different risk, cost, and benefit profiles.

Lift-and-Shift (Rehost)

Copy existing schemas, data, and ETL pipelines to the cloud warehouse with minimal changes. You keep the same data model, same transformation logic, same batch schedule.

Best for: Organizations under time pressure (license expiry, hardware EOL). You need to move fast and plan to optimize later.

-- Lift-and-shift example: a Kimball-style fact table
-- On Snowflake, this works with minimal changes from Teradata
CREATE TABLE fact_sales (
    sale_id        BIGINT AUTOINCREMENT,
    date_key       INTEGER NOT NULL,
    product_key    INTEGER NOT NULL,
    customer_key   INTEGER NOT NULL,
    quantity       INTEGER NOT NULL,
    unit_price     DECIMAL(10,2) NOT NULL,
    discount       DECIMAL(10,2) DEFAULT 0,
    total_amount   DECIMAL(12,2) NOT NULL
)
CLUSTER BY (date_key);

Risks: You carry forward legacy inefficiencies. Partitioning strategies tuned for Teradata’s AMP architecture may perform poorly on Snowflake’s micro-partitions. You miss opportunities to simplify the data model.

Replatform

Move to the cloud warehouse with moderate changes — adopt cloud-native file formats (Parquet over CSV), use COPY commands instead of bulk loaders, and adjust partitioning strategies.

-- Replatform: adopt cloud-native staging patterns
-- Instead of loading from Netezza's nzload:
COPY INTO stage_orders
FROM 's3://data-landing/orders/2026/'
PATTERN = '.*\.parquet'
FILE_FORMAT = (TYPE = PARQUET)
VALIDATION_MODE = RETURN_ERRORS;

Best for: Organizations that want to start optimizing without a full rewrite. Replatforming typically takes 2-4 months and captures 60-70% of cloud-native benefits.

Re-Architect

Redesign the entire data platform: data models, ingestion pipelines, transformation logic, and consumption patterns. This means adopting ELT over ETL, implementing dbt for transformation, and restructuring star schemas for cloud-native performance.

-- Re-architect: use Snowflake streams and tasks for incremental processing
-- Define a stream to capture changes
CREATE STREAM orders_stream ON TABLE orders;

-- Create a task that processes the stream incrementally
CREATE TASK process_orders
  WAREHOUSE = etl_wh
  SCHEDULE = '5 MINUTES'
WHEN
  SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
  MERGE INTO fact_orders f
  USING (
      SELECT date_key, product_key, customer_key,
             COUNT(*) AS order_count,
             SUM(amount) AS total_amount
      FROM orders_stream
      WHERE METADATA$ACTION IN ('INSERT', 'UPDATE')
      GROUP BY date_key, product_key, customer_key
  ) s
  ON f.date_key = s.date_key
     AND f.product_key = s.product_key
     AND f.customer_key = s.customer_key
  WHEN MATCHED THEN UPDATE SET
      order_count = s.order_count,
      total_amount = s.total_amount
  WHEN NOT MATCHED THEN INSERT
      (date_key, product_key, customer_key, order_count, total_amount)
      VALUES (s.date_key, s.product_key, s.customer_key,
              s.order_count, s.total_amount);

Best for: Organizations starting fresh or facing fundamental limitations in their current architecture. The investment is substantial (6-12 months) but delivers the highest return.

Hybrid / Phased Approach

Maintain some workloads on the legacy system while moving others to the cloud. This is the most common real-world pattern.

# Python: orchestrating a hybrid migration
import asyncio
from datetime import datetime

class HybridMigrationOrchestrator:
    """Coordinate data flows between legacy and cloud warehouses."""

    def __init__(self, legacy_config: dict, cloud_config: dict):
        self.legacy = legacy_config
        self.cloud = cloud_config

    async def validate_parity(self, table: str, date: str) -> dict:
        """Compare row counts and checksums between legacy and cloud."""
        tasks = [
            self._query_legacy(table, date),
            self._query_cloud(table, date)
        ]
        legacy_result, cloud_result = await asyncio.gather(*tasks)
        return {
            "table": table,
            "legacy_rows": legacy_result["row_count"],
            "cloud_rows": cloud_result["row_count"],
            "match": legacy_result["row_count"] == cloud_result["row_count"],
            "timestamp": datetime.utcnow().isoformat()
        }

    async def _query_legacy(self, table: str, date: str) -> dict:
        # Simulate querying Teradata/Oracle
        return {"row_count": 1423050}

    async def _query_cloud(self, table: str, date: str) -> dict:
        # Simulate querying Snowflake/BigQuery
        return {"row_count": 1423050}

async def main():
    orchestrator = HybridMigrationOrchestrator(
        legacy_config={"host": "teradata-01", "user": "etl_user"},
        cloud_config={"account": "my_account", "database": "ANALYTICS"}
    )
    result = await orchestrator.validate_parity("fact_orders", "2026-04-01")
    print(f"Parity check: {result}")

ETL vs ELT Modernization

Legacy warehouses relied on ETL: extract data, transform it in a dedicated engine, then load the transformed result. Cloud-native warehouses flip this to ELT: load raw data first, transform it in the warehouse using its massive compute capacity.

Why ELT Wins in the Cloud

Cloud warehouses have near-unlimited compute. Running transformations inside the warehouse eliminates the need for a separate transformation server. It also preserves raw data, enabling reprocessing without re-extraction.

# Python: traditional ETL approach — transform before loading
def etl_pipeline(extract_path: str, load_path: str):
    """Traditional ETL: transform in Python, then load."""
    import pandas as pd

    # Extract
    df = pd.read_csv(extract_path)

    # Transform (expensive in Python for large datasets)
    df['order_date'] = pd.to_datetime(df['order_date'])
    df['total'] = df['quantity'] * df['unit_price'] * (1 - df['discount'])
    df_aggregated = df.groupby(
        [df['order_date'].dt.strftime('%Y-%m-%d'), 'product_id']
    ).agg({'total': 'sum', 'order_id': 'count'}).reset_index()

    # Load
    df_aggregated.to_csv(load_path, index=False)
    print(f"Loaded {len(df_aggregated)} aggregated rows")
-- Modern ELT: load raw data, transform in the warehouse
-- Step 1: Load raw data (minimal transformation)
COPY INTO raw_orders
FROM @external_stage/orders/
FILE_FORMAT = (TYPE = CSV SKIP_HEADER = 1);

-- Step 2: Transform in the warehouse
CREATE OR REPLACE TABLE agg_daily_sales AS
SELECT
    DATE_TRUNC('day', order_date) AS sale_date,
    product_id,
    SUM(quantity * unit_price * (1 - COALESCE(discount, 0))) AS total_revenue,
    COUNT(DISTINCT order_id) AS order_count
FROM raw_orders
GROUP BY 1, 2;

-- The raw data remains available for reprocessing or alternative transforms

dbt for Data Transformation

dbt (data build tool) has become the standard for ELT transformations. It enables version-controlled, testable SQL transformations that run inside the warehouse.

-- dbt model: staging layer
-- models/staging/stg_orders.sql

WITH source AS (
    SELECT *
    FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        product_id,
        order_date::DATE AS order_date,
        quantity::INTEGER AS quantity,
        unit_price::DECIMAL(10,2) AS unit_price,
        discount::DECIMAL(4,3) AS discount,
        CURRENT_TIMESTAMP AS loaded_at
    FROM source
    WHERE id IS NOT NULL
)

SELECT * FROM renamed
-- dbt model: fact table
-- models/marts/fact_orders.sql

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

line_items AS (
    SELECT * FROM {{ ref('stg_line_items') }}
),

joined AS (
    SELECT
        orders.order_id,
        orders.customer_id,
        orders.order_date,
        line_items.product_id,
        line_items.quantity,
        line_items.unit_price,
        line_items.discount,
        (line_items.quantity * line_items.unit_price)
        * (1 - COALESCE(line_items.discount, 0)) AS total_amount
    FROM orders
    LEFT JOIN line_items USING (order_id)
)

SELECT * FROM joined
# dbt project configuration
# dbt_project.yml

name: 'warehouse_modernization'
version: '1.0.0'
config-version: 2

profile: 'snowflake'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"

models:
  warehouse_modernization:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      +schema: analytics

Data Modeling Evolution

Legacy warehouses predominantly used Kimball dimensional modeling — star schemas with conformed dimensions. Cloud-native platforms enable more flexible approaches.

Kimball in the Cloud

Star schemas still work well on cloud warehouses. Snowflake’s automatic micro-partitioning eliminates manual distribution key decisions. Redshift’s sort keys map naturally to dimension hierarchies.

-- Kimball-style dimension with Type 2 slowly changing dimension
CREATE TABLE dim_customer (
    customer_key   INTEGER AUTOINCREMENT PRIMARY KEY,
    customer_id    VARCHAR(50) NOT NULL,
    customer_name  VARCHAR(200) NOT NULL,
    email          VARCHAR(200),
    segment        VARCHAR(50),
    effective_date DATE NOT NULL,
    end_date       DATE,
    is_current     BOOLEAN DEFAULT TRUE
);

MERGE INTO dim_customer AS target
USING (
    SELECT
        customer_id,
        customer_name,
        email,
        segment,
        '2026-04-01'::DATE AS effective_date
    FROM raw_customers
    WHERE loaded_at = (SELECT MAX(loaded_at) FROM raw_customers)
) AS source
ON target.customer_id = source.customer_id
   AND target.is_current = TRUE
WHEN MATCHED AND (
    target.customer_name != source.customer_name
    OR target.email != source.email
    OR target.segment != source.segment
) THEN UPDATE SET
    end_date = source.effective_date,
    is_current = FALSE;

Data Vault 2.0

Data Vault modeling has gained traction for large-scale cloud warehouses. It provides auditability and flexibility for enterprises that need to track every data change.

-- Data Vault hub: business keys
CREATE TABLE hub_customer (
    customer_hash   CHAR(32) PRIMARY KEY,
    customer_id     VARCHAR(50) NOT NULL UNIQUE,
    load_date       TIMESTAMP_NTZ NOT NULL,
    record_source   VARCHAR(100) NOT NULL
);

-- Data Vault satellite: descriptive attributes
CREATE TABLE sat_customer_details (
    customer_hash   CHAR(32) NOT NULL REFERENCES hub_customer(customer_hash),
    load_date       TIMESTAMP_NTZ NOT NULL,
    customer_name   VARCHAR(200),
    email           VARCHAR(200),
    segment         VARCHAR(50),
    hash_diff       CHAR(32),
    PRIMARY KEY (customer_hash, load_date)
);

-- Data Vault link: transactions
CREATE TABLE link_order (
    order_hash      CHAR(32) PRIMARY KEY,
    customer_hash   CHAR(32) NOT NULL,
    order_id        VARCHAR(50) NOT NULL UNIQUE,
    load_date       TIMESTAMP_NTZ NOT NULL,
    record_source   VARCHAR(100) NOT NULL
);

When to Use Each Approach

Approach Best For Avoid When
Kimball / Star Schema BI tools, predictable reporting, business user comprehension Need full historical audit trail
Data Vault 2.0 Large scale, heterogeneous sources, need auditability Simple domains, small teams, fast delivery
One Big Table (OBT) Data science, single-table queries, ML feature stores Complex reporting with multiple fact grains
Lakehouse (Delta Lake) Unified batch + streaming, ML + BI on same data Pure SQL tooling required

Cost Optimization

Cloud data warehouses change cost dynamics — you trade fixed capex for variable opex. Without discipline, costs can spiral.

Snowflake Cost Controls

-- Set resource monitors to prevent cost overruns
CREATE RESOURCE MONITOR monthly_budget
  WITH CREDIT_QUOTA = 5000
  FREQUENCY = MONTHLY
  START_TIMESTAMP = '2026-04-01 00:00:00'
  TRIGGERS
    ON 80% DO NOTIFY
    ON 100% DO NOTIFY
    ON 110% DO SUSPEND;

ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = monthly_budget;

BigQuery Cost Controls

-- Use partitioned tables to limit query bytes
CREATE TABLE analytics.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY event_type, user_id
AS
SELECT * FROM raw_events;

-- Before running expensive queries, estimate cost
-- In BigQuery UI, use "Validation" mode or:
SELECT SUM(size_bytes) / POW(1024, 4) AS estimated_tb_processed
FROM `region-us.INFORMATION_SCHEMA.TABLE_STORAGE`
WHERE table_name = 'events';

Redshift Cost Controls

-- Use workload management queues to prevent runaway queries
CREATE WLM QUEUE reporting
  WITH
    QUERY_CONCURRENCY = 5
    MEMORY_PERCENT_TO_USE = 40;

CREATE WLM QUEUE etl
  WITH
    QUERY_CONCURRENCY = 3
    MEMORY_PERCENT_TO_USE = 60;

-- Use concurrency scaling for unpredictable BI spikes
ALTER CLUSTER my_cluster
  CONCURRENCY_SCALING TARGET_QUERY_CONCURRENCY = 5;

Terraform for Infrastructure Management

# Terraform: Snowflake infrastructure as code
resource "snowflake_warehouse" "etl" {
  name           = "ETL_WH"
  warehouse_size = "x-large"
  auto_suspend   = 60
  auto_resume    = true

  resource_monitor = snowflake_resource_monitor.monthly_budget.name
}

resource "snowflake_resource_monitor" "monthly_budget" {
  name         = "MONTHLY_BUDGET"
  credit_quota = 5000
  frequency    = "MONTHLY"

  triggers {
    threshold_percent = 80
    action             = "NOTIFY"
  }
  triggers {
    threshold_percent = 110
    action             = "SUSPEND"
  }
}

resource "snowflake_database" "analytics" {
  name    = "ANALYTICS"
  comment = "Production analytics database"
}

Data Sharing and Collaboration

Cloud data warehouses enable data sharing across organizational boundaries without copying data.

Snowflake Data Sharing

-- Create a share and add objects
CREATE SHARE partner_analytics;
GRANT USAGE ON DATABASE analytics TO SHARE partner_analytics;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.marts TO SHARE partner_analytics;
ALTER SHARE partner_analytics SET ACCOUNTS = [PARTNER_ACCOUNT_LOCATOR];
-- The partner queries the data from their own Snowflake account
-- No data movement, no ETL, always current

Delta Sharing (Databricks)

-- Databricks: create a Delta Share
CREATE SHARE quarterly_report;
ALTER SHARE quarterly_report ADD TABLE analytics.marts.financial_summary;
ALTER SHARE quarterly_report ADD TABLE analytics.marts.customer_metrics;
-- Recipients access the share via their own Databricks workspace
-- or via any Delta Sharing compatible client (Python, Spark, Pandas)

Multi-Cloud Considerations

Organizations increasingly operate across multiple clouds. Cloud data warehouse decisions affect data residency, failover strategy, and vendor leverage.

Snowflake: Multi-Cloud by Design

Snowflake runs on AWS, GCP, and Azure with consistent features and SQL dialect. This enables multi-cloud strategies:

-- Snowflake: replicate data across clouds for disaster recovery
CREATE REPLICATION GROUP dr_group
  OBJECT_TYPES = DATABASES, SHARES
  ALLOWED_LOCATIONS = ('aws-eu-west-1', 'azure-eu-west');

ALTER DATABASE analytics ENABLE REPLICATION
  FOR REPLICATION GROUP dr_group;

Data Residency Patterns

# Python: enforce data residency routing
class DataResidencyRouter:
    """Route data to the correct cloud region based on residency requirements."""

    REGION_MAP = {
        "EU": {"cloud": "aws", "region": "eu-central-1", "account": "EU_ACCOUNT"},
        "US": {"cloud": "aws", "region": "us-east-1", "account": "US_ACCOUNT"},
        "APAC": {"cloud": "azure", "region": "southeastasia", "account": "APAC_ACCOUNT"},
    }

    def route_table(self, table: str, residency: str):
        config = self.REGION_MAP[residency]
        print(f"Routing {table} to {config['cloud']}/{config['region']}")
        # Execute CREATE TABLE with appropriate location binding
        return config

Migration Step-by-Step Methodology

Phase 1: Assessment (Weeks 1-4)

  1. Inventory workloads. List every ETL job, dashboard, report, and ad-hoc query. Map them to tables and schemas.
  2. Profile queries. Identify top-50 queries by execution time and frequency. These drive performance requirements.
  3. Measure data volumes. Row counts, table sizes, growth rates. Understand what must be migrated immediately and what can wait.
  4. Document dependencies. Which tables feed which reports? What are SLA expectations?
  5. Choose migration approach per workload. Lift-and-shift for 80% of tables, re-architect for 20% (the complex or high-value ones).

Phase 2: Foundation (Weeks 5-8)

  1. Provision cloud warehouse. Set up accounts, roles, warehouses, resource monitors, network policies.
  2. Establish data loading pipeline. Set up COPY commands, external stages, and scheduled ingestion.
  3. Replicate schema. Use tools like Schema Migration Tool (Snowflake), AWS DMS, or manual DDL conversion.
  4. Validate data fidelity. Row counts, checksums, NULL distributions. Every column must match.

Phase 3: Migration (Weeks 9-20)

  1. Parallel run. Run legacy and cloud systems simultaneously. Route read-only workloads to the cloud.
  2. Migrate ETL jobs batch by batch. Start with daily batch jobs, move to real-time streaming last.
  3. Optimize as you go. Clustering keys, materialized views, and caching. Iterate based on actual performance.
  4. Retire legacy components. Once a workload is validated in the cloud, shut down the corresponding legacy job.

Phase 4: Optimization (Ongoing)

  1. Revisit data models. After 30 days of cloud-native operation, identify star schemas that can be denormalized.
  2. Automate cost monitoring. Set up alerts for anomalous spend.
  3. Train the team. Help analysts use cloud-native features — zero-copy clones, time travel, query profiling.

Common Pitfalls and How to Avoid Them

Pitfall 1: Ignoring Concurrency Requirements

Legacy warehouses handle 10-50 concurrent queries. Cloud warehouses can handle 500+ — but only if you configure concurrency properly. Many teams migrate a query that ran in 5 seconds on Teradata and find it runs in 5 seconds on Snowflake for one user, but 30 seconds for ten.

Fix: Profile your query concurrency needs before migration. In Snowflake, use multi-cluster warehouses: MAX_CLUSTER_COUNT = 5. In BigQuery, use flat-rate reservations with enough slots.

Pitfall 2: Keeping the Same Data Model

The Monstrosity Fact Table — a fact table with 47 dimensions and 120 measures — does not perform better in the cloud. It performs worse, because cloud warehouses charge by bytes scanned.

Fix: Split monster fact tables into focused fact tables per business process. Use dbt to manage the refactoring.

Pitfall 3: Missing Cost Visibility

The biggest surprise teams face: their first Snowflake bill is 3X the Teradata maintenance cost. Not because Snowflake is expensive, but because nobody suspended warehouses after loading.

Fix: Implement resource monitors day one. Set AUTO_SUSPEND = 60 (seconds) on all warehouses. Educate every user that a running warehouse costs money.

Pitfall 4: Not Validating Semi-Structured Data

JSON columns that worked in Oracle’s XMLDB or MongoDB may not translate cleanly to Snowflake’s VARIANT or BigQuery’s RECORD type. Deeply nested JSON arrays with varying schemas cause silent data loss or query failures.

-- Snowflake: flatten nested JSON correctly
SELECT
    event_id,
    event_timestamp,
    payload:user:id::STRING AS user_id,
    payload:metadata:page::STRING AS page,
    item.value:product_id::STRING AS product_id,
    item.value:price::FLOAT AS price
FROM raw_events,
LATERAL FLATTEN(input => payload:items) AS item;
-- Without LATERAL FLATTEN, array items inside JSON are invisible to queries

Fix: Test with your actual JSON structure during the assessment phase. Create a test table, load 1 million records, and validate every field is accessible.

Pitfall 5: Underestimating Reverse ETL

Data warehouses traditionally served BI tools. Modern architectures require reverse ETL — pushing transformed data back into operational systems (CRMs, marketing platforms). Cloud warehouses support this natively through Snowflake Tasks, BigQuery procedures, or dedicated reverse ETL tools (Hightouch, Census).

-- Snowflake: push aggregated metrics back to Salesforce via task
CREATE TASK push_customer_metrics_to_salesforce
  WAREHOUSE = etl_wh
  SCHEDULE = '1 HOUR'
AS
  CALL write_to_salesforce_api(
    object_type => 'Account',
    data_query => $$
        SELECT
            customer_id AS External_Id__c,
            total_revenue,
            lifetime_orders
        FROM analytics.marts.customer_kpi
        WHERE last_updated >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
    $$
  );

Fix: Include reverse ETL in your architecture from the start. It is not an afterthought — it is a core integration pattern.

Pitfall 6: Neglecting Data Governance

Cloud warehouses make it trivially easy to share data. That is a feature and a risk. Without proper access controls, sensitive data can leak through data shares, zero-copy clones, or even query result caching.

-- Snowflake: implement row-level security
CREATE SECURE VIEW customer_sales AS
SELECT *
FROM fact_sales
WHERE customer_id IN (
    SELECT customer_id
    FROM dim_customer
    WHERE region = CURRENT_USER_REGION()
);

-- Mask sensitive columns
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('DATA_STEWARD', 'ADMIN') THEN val
    ELSE CONCAT(LEFT(val, 2), '****@', SPLIT_PART(val, '@', -1))
  END;

ALTER TABLE dim_customer
  MODIFY COLUMN email
  SET MASKING POLICY email_mask;

Conclusion

Data warehouse modernization is not a technology problem — it is a migration problem. The technology choices are clear: Snowflake, BigQuery, Redshift, and Databricks all deliver storage-compute separation, elastic scaling, and cloud-native economics. The hard part is the transition — inventorying workloads, managing parallel runs, validating data fidelity, and retraining teams.

Start with a thorough assessment. Choose lift-and-shift for the bulk of your tables to move quickly. Re-architect the 20% of workloads that drive 80% of the value. Implement cost controls on day one, not after the first surprise bill. Validate every row of data during migration. And build governance into the architecture from the start — retrofitting access controls is far harder than designing them in.

Organizations that execute this transition methodically reduce total cost of ownership by 30-50%, eliminate capacity planning cycles, and deliver analytics insights in minutes instead of hours. The effort is substantial. The payoff is transformative.

Resources

Comments

Share this article

Scan to read on mobile