Introduction
Your legacy Teradata appliance is nearing end-of-life, your Oracle Exadata license renewal is due, and your analytics team is waiting hours for queries to finish. You are not alone. Organizations running on-premises data warehouses face mounting pressure to modernize — not because cloud is trendy, but because the gap between what modern analytics demands and what legacy systems deliver has become untenable. This article walks through every dimension of data warehouse modernization: why legacy systems fail, what cloud-native architectures offer, how to migrate safely, and what pitfalls to avoid. You will come away with a concrete migration playbook you can adapt to your organization.
Legacy Data Warehouse Challenges
Teradata: Massive Parallel Processing Limitations
Teradata pioneered massively parallel processing (MPP) for data warehousing. Its shared-nothing architecture distributes data across AMPs (Access Module Processors), each handling a slice of the data. This design works well for predictable, batch-oriented workloads. It breaks down when you need elasticity.
-- Teradata: adding a node requires redistributing all data
-- This operation can take days for multi-terabyte deployments
ALTER SYSTEM ADD NODE 'node5';
-- After this, you must redistribute data across all AMPs
-- Downtime is typically required
The procurement cycle for Teradata hardware runs 8-16 weeks. If your data volume grows 40% year-over-year, you must forecast capacity 18 months out and buy for the peak. That is capital expenditure at its most wasteful — you pay for hardware that sits idle 60-70% of the time.
Oracle Exadata: Licensing and Lock-In
Oracle Exadata combines specialized hardware with Oracle Database to deliver high-throughput analytics. The catch: licensing costs scale linearly (or worse) with data volume. A typical Exadata deployment with 100 TB of usable storage costs $500,000-$1,000,000 per year in licensing alone, plus hardware maintenance.
-- Oracle: partitioning helps but requires careful design
CREATE TABLE sales (
sale_id NUMBER,
sale_date DATE,
amount NUMBER
)
PARTITION BY RANGE (sale_date) (
PARTITION p2024 VALUES LESS THAN (TO_DATE('2025-01-01','YYYY-MM-DD')),
PARTITION p2025 VALUES LESS THAN (TO_DATE('2026-01-01','YYYY-MM-DD')),
PARTITION p_max VALUES LESS THAN (MAXVALUE)
);
-- Partition pruning helps, but you still pay license fees per core
-- Adding partitions requires DDL locks on the table
Oracle’s pricing model punishes growth. More data means more storage, which means more cores, which means exponentially higher licensing. Organizations often find themselves paying more for Oracle licenses than for the hardware or cloud infrastructure itself.
IBM Netezza: Proprietary Hardware
IBM Netezza used FPGA-accelerated appliances that were fast for their era but impossible to scale incrementally. The TwinFin and Stripe nodes were fixed configurations. You bought a node, bolted it on, and hoped the data redistribution completed before your next batch window.
# Netezza administration: checking node health
nzstate -all
# Output: Node0: Online, Node1: Online, Node2: DEGRADED
# A degraded node means the entire appliance runs slower
# Replacing a node requires IBM field service — 4-8 hour SLA minimum
Netezza’s end-of-life announcements drove many organizations to urgent modernization programs. When IBM stopped selling Netezza appliances in 2015 and ended support, companies with petabyte-scale Netezza deployments had to replatform with no migration path from IBM.
Common Legacy Pain Points
Beyond vendor-specific issues, legacy warehouses share structural problems:
Rigid scaling. All three platforms require adding hardware nodes in fixed increments. You cannot scale compute without scaling storage, even when you only need more query throughput.
Mixed workload contention. Legacy systems cannot isolate ETL workloads from BI queries. A long-running aggregation job slows dashboards for everyone.
Disaster recovery complexity. Netezza and Teradata disaster recovery requires active-passive standby deployments that double infrastructure cost. Oracle Data Guard helps but adds licensing complexity.
Limited semi-structured data support. Legacy systems were designed for relational data. JSON, Parquet, and Avro require complex ETL transformations before loading.
Query performance degradation at scale. As data grows, query performance degrades non-linearly because statistics become stale, partitions grow uneven, and the optimizer cannot keep up.
-- Typical legacy pain point: stale statistics ruin query performance
-- In Teradata:
COLLECT STATISTICS ON sales INDEX (sale_date);
-- If you forget to recollect after large loads, the optimizer makes bad decisions
-- and queries that ran in 2 minutes now run in 2 hours
Cloud-Native Architecture: Storage-Compute Separation
The defining architectural shift in cloud data warehouses is separating storage from compute. In legacy systems, storage and compute are coupled — each node has disks and CPUs, and adding capacity means adding both. Cloud-native warehouses decouple the two, enabling each to scale independently.
How Storage-Compute Separation Works
Cloud data warehouses store data in object storage (Amazon S3, Google Cloud Storage, Azure Blob Storage) and use a separate compute cluster for query execution. The storage layer is shared and durable; the compute layer is ephemeral and elastic.
┌─────────────────────┐
│ Query Layer │
│ (Virtual Warehouses) │
│ ┌───┐ ┌───┐ ┌───┐ │
│ │ VW│ │ VW│ │ VW│ │
│ │ 1 │ │ 2 │ │ 3 │ ← independent scaling
│ └───┘ └───┘ └───┘ │
└─────────┬───────────┘
│
┌─────────┴───────────┐
│ Metadata & Caching │
│ (Cloud Service) │
└─────────┬───────────┘
│
┌─────────┴───────────┐
│ Storage Layer │
│ (S3/GCS/Blob) │
│ ┌───┐ ┌───┐ ┌───┐ │
│ │ P │ │ P │ │ P │ │ ← compressed, columnar
│ │ a │ │ a │ │ a │ ← Parquet/ORC files
│ │ r │ │ r │ │ r │
│ │ t │ │ t │ │ t │
│ └───┘ └───┘ └───┘ │
└─────────────────────┘
Benefits include:
- Compute elasticity. Spin up 10X compute for a monthly report, then shut it down. Pay only for the seconds it runs.
- Storage durability. Object storage provides 99.9999999999% durability. No RAID, no replication management.
- Workload isolation. Run ETL on one compute cluster, BI on another, data science on a third. They share the same data without contention.
- Instant clones. Create zero-copy clones of multi-terabyte databases for development, testing, or what-if analysis.
-- Snowflake: create a zero-copy clone for testing
CREATE DATABASE prod_db CLONE;
CREATE DATABASE dev_db CLONE prod_db;
-- The clone consumes no additional storage until you modify data
-- This is physically impossible on Teradata or Oracle
-- Snowflake: independent compute scaling
CREATE WAREHOUSE etl_wh
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
CREATE WAREHOUSE bi_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
-- ETL warehouse and BI warehouse share the same data
-- but never compete for compute resources
Cloud Data Warehouse Comparison
| Feature | Snowflake | Google BigQuery | Amazon Redshift | Databricks SQL |
|---|---|---|---|---|
| Architecture | Storage-compute separated | Serverless (Dremel) | Mixed (RA3: separated) | Delta Lake + Photon |
| Scaling | Auto-scale virtual warehouses | Automatic slot-based | Elastic resize + concurrency scaling | Auto-scaling clusters |
| Storage format | Proprietary compressed | Capacitor columnar | RA3 managed storage | Delta Lake (Parquet) |
| Pricing model | Compute + storage (per sec) | Per-slot or on-demand (per TB) | Per-node hourly + storage | DBU per hour |
| Data sharing | Reader accounts + Marketplace | Authorized views + Analytics Hub | Datashare | Delta Sharing |
| Semi-structured | VARIANT type (JSON, Avro, XML) | RECORD/REPEATED + JSON functions | SUPER type | STRUCT/ARRAY + Spark |
| Time travel | 1-90 days | 7 days (default) | 1-2 days (snapshots) | Delta time travel (configurable) |
| Partitioning | Automatic micro-partitions | Automatic (ingestion time) | Distribution + sort keys | Delta partitioning + Z-order |
| Minimum compute | Resume from 0 | 0 (serverless) | Minimum 2 nodes | Depends on cluster config |
| Query language | SQL + Snowpark (Python/Java) | SQL + UDFs (JS/SQL/Python) | SQL + Python UDFs | SQL + Python + R + Scala |
| Multi-cloud | AWS, GCP, Azure | GCP only | AWS only | AWS, GCP, Azure |
When Each Platform Excels
Snowflake suits organizations that value simplicity and data sharing. Its separation of storage and compute is the cleanest implementation. The VARIANT data type handles JSON without schema definition, making it ideal for semi-structured data.
BigQuery excels for serverless, high-throughput analytics. There is no infrastructure to manage — no clusters, no warehouses, no nodes. It is the best choice for Google Cloud-native shops and organizations that want zero ops overhead.
Redshift works well for AWS-native organizations with predictable, heavy ETL workloads. RA3 nodes provide storage-compute separation, and Redshift Spectrum queries external data in S3 directly. Redshift provides the best price-performance for consistently large workloads.
Databricks SQL (Photon engine) excels when your data pipeline already uses Apache Spark. The Delta Lake format provides ACID transactions, schema enforcement, and time travel on data lake storage. It bridges data warehousing and data science on a single platform.
# Python: querying across cloud warehouses with SQLAlchemy
from sqlalchemy import create_engine
def get_warehouse_connection(warehouse_type: str, config: dict):
"""Return SQLAlchemy engine for the specified warehouse."""
connections = {
"snowflake": f"snowflake://{config['user']}:{config['password']}@{config['account']}/{config['database']}",
"redshift": f"redshift+psycopg2://{config['user']}:{config['password']}@{config['host']}:5439/{config['database']}",
"bigquery": f"bigquery://{config['project_id']}"
}
engine = create_engine(connections[warehouse_type])
return engine
# Usage: run the same query across platforms
query = """
SELECT DATE_TRUNC('month', order_date) AS month,
COUNT(*) AS order_count,
SUM(total) AS revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY 1
ORDER BY 1
"""
Migration Strategies
Organizations choose from four migration approaches, each with different risk, cost, and benefit profiles.
Lift-and-Shift (Rehost)
Copy existing schemas, data, and ETL pipelines to the cloud warehouse with minimal changes. You keep the same data model, same transformation logic, same batch schedule.
Best for: Organizations under time pressure (license expiry, hardware EOL). You need to move fast and plan to optimize later.
-- Lift-and-shift example: a Kimball-style fact table
-- On Snowflake, this works with minimal changes from Teradata
CREATE TABLE fact_sales (
sale_id BIGINT AUTOINCREMENT,
date_key INTEGER NOT NULL,
product_key INTEGER NOT NULL,
customer_key INTEGER NOT NULL,
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
discount DECIMAL(10,2) DEFAULT 0,
total_amount DECIMAL(12,2) NOT NULL
)
CLUSTER BY (date_key);
Risks: You carry forward legacy inefficiencies. Partitioning strategies tuned for Teradata’s AMP architecture may perform poorly on Snowflake’s micro-partitions. You miss opportunities to simplify the data model.
Replatform
Move to the cloud warehouse with moderate changes — adopt cloud-native file formats (Parquet over CSV), use COPY commands instead of bulk loaders, and adjust partitioning strategies.
-- Replatform: adopt cloud-native staging patterns
-- Instead of loading from Netezza's nzload:
COPY INTO stage_orders
FROM 's3://data-landing/orders/2026/'
PATTERN = '.*\.parquet'
FILE_FORMAT = (TYPE = PARQUET)
VALIDATION_MODE = RETURN_ERRORS;
Best for: Organizations that want to start optimizing without a full rewrite. Replatforming typically takes 2-4 months and captures 60-70% of cloud-native benefits.
Re-Architect
Redesign the entire data platform: data models, ingestion pipelines, transformation logic, and consumption patterns. This means adopting ELT over ETL, implementing dbt for transformation, and restructuring star schemas for cloud-native performance.
-- Re-architect: use Snowflake streams and tasks for incremental processing
-- Define a stream to capture changes
CREATE STREAM orders_stream ON TABLE orders;
-- Create a task that processes the stream incrementally
CREATE TASK process_orders
WAREHOUSE = etl_wh
SCHEDULE = '5 MINUTES'
WHEN
SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
MERGE INTO fact_orders f
USING (
SELECT date_key, product_key, customer_key,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM orders_stream
WHERE METADATA$ACTION IN ('INSERT', 'UPDATE')
GROUP BY date_key, product_key, customer_key
) s
ON f.date_key = s.date_key
AND f.product_key = s.product_key
AND f.customer_key = s.customer_key
WHEN MATCHED THEN UPDATE SET
order_count = s.order_count,
total_amount = s.total_amount
WHEN NOT MATCHED THEN INSERT
(date_key, product_key, customer_key, order_count, total_amount)
VALUES (s.date_key, s.product_key, s.customer_key,
s.order_count, s.total_amount);
Best for: Organizations starting fresh or facing fundamental limitations in their current architecture. The investment is substantial (6-12 months) but delivers the highest return.
Hybrid / Phased Approach
Maintain some workloads on the legacy system while moving others to the cloud. This is the most common real-world pattern.
# Python: orchestrating a hybrid migration
import asyncio
from datetime import datetime
class HybridMigrationOrchestrator:
"""Coordinate data flows between legacy and cloud warehouses."""
def __init__(self, legacy_config: dict, cloud_config: dict):
self.legacy = legacy_config
self.cloud = cloud_config
async def validate_parity(self, table: str, date: str) -> dict:
"""Compare row counts and checksums between legacy and cloud."""
tasks = [
self._query_legacy(table, date),
self._query_cloud(table, date)
]
legacy_result, cloud_result = await asyncio.gather(*tasks)
return {
"table": table,
"legacy_rows": legacy_result["row_count"],
"cloud_rows": cloud_result["row_count"],
"match": legacy_result["row_count"] == cloud_result["row_count"],
"timestamp": datetime.utcnow().isoformat()
}
async def _query_legacy(self, table: str, date: str) -> dict:
# Simulate querying Teradata/Oracle
return {"row_count": 1423050}
async def _query_cloud(self, table: str, date: str) -> dict:
# Simulate querying Snowflake/BigQuery
return {"row_count": 1423050}
async def main():
orchestrator = HybridMigrationOrchestrator(
legacy_config={"host": "teradata-01", "user": "etl_user"},
cloud_config={"account": "my_account", "database": "ANALYTICS"}
)
result = await orchestrator.validate_parity("fact_orders", "2026-04-01")
print(f"Parity check: {result}")
ETL vs ELT Modernization
Legacy warehouses relied on ETL: extract data, transform it in a dedicated engine, then load the transformed result. Cloud-native warehouses flip this to ELT: load raw data first, transform it in the warehouse using its massive compute capacity.
Why ELT Wins in the Cloud
Cloud warehouses have near-unlimited compute. Running transformations inside the warehouse eliminates the need for a separate transformation server. It also preserves raw data, enabling reprocessing without re-extraction.
# Python: traditional ETL approach — transform before loading
def etl_pipeline(extract_path: str, load_path: str):
"""Traditional ETL: transform in Python, then load."""
import pandas as pd
# Extract
df = pd.read_csv(extract_path)
# Transform (expensive in Python for large datasets)
df['order_date'] = pd.to_datetime(df['order_date'])
df['total'] = df['quantity'] * df['unit_price'] * (1 - df['discount'])
df_aggregated = df.groupby(
[df['order_date'].dt.strftime('%Y-%m-%d'), 'product_id']
).agg({'total': 'sum', 'order_id': 'count'}).reset_index()
# Load
df_aggregated.to_csv(load_path, index=False)
print(f"Loaded {len(df_aggregated)} aggregated rows")
-- Modern ELT: load raw data, transform in the warehouse
-- Step 1: Load raw data (minimal transformation)
COPY INTO raw_orders
FROM @external_stage/orders/
FILE_FORMAT = (TYPE = CSV SKIP_HEADER = 1);
-- Step 2: Transform in the warehouse
CREATE OR REPLACE TABLE agg_daily_sales AS
SELECT
DATE_TRUNC('day', order_date) AS sale_date,
product_id,
SUM(quantity * unit_price * (1 - COALESCE(discount, 0))) AS total_revenue,
COUNT(DISTINCT order_id) AS order_count
FROM raw_orders
GROUP BY 1, 2;
-- The raw data remains available for reprocessing or alternative transforms
dbt for Data Transformation
dbt (data build tool) has become the standard for ELT transformations. It enables version-controlled, testable SQL transformations that run inside the warehouse.
-- dbt model: staging layer
-- models/staging/stg_orders.sql
WITH source AS (
SELECT *
FROM {{ source('raw', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
product_id,
order_date::DATE AS order_date,
quantity::INTEGER AS quantity,
unit_price::DECIMAL(10,2) AS unit_price,
discount::DECIMAL(4,3) AS discount,
CURRENT_TIMESTAMP AS loaded_at
FROM source
WHERE id IS NOT NULL
)
SELECT * FROM renamed
-- dbt model: fact table
-- models/marts/fact_orders.sql
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
line_items AS (
SELECT * FROM {{ ref('stg_line_items') }}
),
joined AS (
SELECT
orders.order_id,
orders.customer_id,
orders.order_date,
line_items.product_id,
line_items.quantity,
line_items.unit_price,
line_items.discount,
(line_items.quantity * line_items.unit_price)
* (1 - COALESCE(line_items.discount, 0)) AS total_amount
FROM orders
LEFT JOIN line_items USING (order_id)
)
SELECT * FROM joined
# dbt project configuration
# dbt_project.yml
name: 'warehouse_modernization'
version: '1.0.0'
config-version: 2
profile: 'snowflake'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
warehouse_modernization:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
marts:
+materialized: table
+schema: analytics
Data Modeling Evolution
Legacy warehouses predominantly used Kimball dimensional modeling — star schemas with conformed dimensions. Cloud-native platforms enable more flexible approaches.
Kimball in the Cloud
Star schemas still work well on cloud warehouses. Snowflake’s automatic micro-partitioning eliminates manual distribution key decisions. Redshift’s sort keys map naturally to dimension hierarchies.
-- Kimball-style dimension with Type 2 slowly changing dimension
CREATE TABLE dim_customer (
customer_key INTEGER AUTOINCREMENT PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL,
customer_name VARCHAR(200) NOT NULL,
email VARCHAR(200),
segment VARCHAR(50),
effective_date DATE NOT NULL,
end_date DATE,
is_current BOOLEAN DEFAULT TRUE
);
MERGE INTO dim_customer AS target
USING (
SELECT
customer_id,
customer_name,
email,
segment,
'2026-04-01'::DATE AS effective_date
FROM raw_customers
WHERE loaded_at = (SELECT MAX(loaded_at) FROM raw_customers)
) AS source
ON target.customer_id = source.customer_id
AND target.is_current = TRUE
WHEN MATCHED AND (
target.customer_name != source.customer_name
OR target.email != source.email
OR target.segment != source.segment
) THEN UPDATE SET
end_date = source.effective_date,
is_current = FALSE;
Data Vault 2.0
Data Vault modeling has gained traction for large-scale cloud warehouses. It provides auditability and flexibility for enterprises that need to track every data change.
-- Data Vault hub: business keys
CREATE TABLE hub_customer (
customer_hash CHAR(32) PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL UNIQUE,
load_date TIMESTAMP_NTZ NOT NULL,
record_source VARCHAR(100) NOT NULL
);
-- Data Vault satellite: descriptive attributes
CREATE TABLE sat_customer_details (
customer_hash CHAR(32) NOT NULL REFERENCES hub_customer(customer_hash),
load_date TIMESTAMP_NTZ NOT NULL,
customer_name VARCHAR(200),
email VARCHAR(200),
segment VARCHAR(50),
hash_diff CHAR(32),
PRIMARY KEY (customer_hash, load_date)
);
-- Data Vault link: transactions
CREATE TABLE link_order (
order_hash CHAR(32) PRIMARY KEY,
customer_hash CHAR(32) NOT NULL,
order_id VARCHAR(50) NOT NULL UNIQUE,
load_date TIMESTAMP_NTZ NOT NULL,
record_source VARCHAR(100) NOT NULL
);
When to Use Each Approach
| Approach | Best For | Avoid When |
|---|---|---|
| Kimball / Star Schema | BI tools, predictable reporting, business user comprehension | Need full historical audit trail |
| Data Vault 2.0 | Large scale, heterogeneous sources, need auditability | Simple domains, small teams, fast delivery |
| One Big Table (OBT) | Data science, single-table queries, ML feature stores | Complex reporting with multiple fact grains |
| Lakehouse (Delta Lake) | Unified batch + streaming, ML + BI on same data | Pure SQL tooling required |
Cost Optimization
Cloud data warehouses change cost dynamics — you trade fixed capex for variable opex. Without discipline, costs can spiral.
Snowflake Cost Controls
-- Set resource monitors to prevent cost overruns
CREATE RESOURCE MONITOR monthly_budget
WITH CREDIT_QUOTA = 5000
FREQUENCY = MONTHLY
START_TIMESTAMP = '2026-04-01 00:00:00'
TRIGGERS
ON 80% DO NOTIFY
ON 100% DO NOTIFY
ON 110% DO SUSPEND;
ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = monthly_budget;
BigQuery Cost Controls
-- Use partitioned tables to limit query bytes
CREATE TABLE analytics.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY event_type, user_id
AS
SELECT * FROM raw_events;
-- Before running expensive queries, estimate cost
-- In BigQuery UI, use "Validation" mode or:
SELECT SUM(size_bytes) / POW(1024, 4) AS estimated_tb_processed
FROM `region-us.INFORMATION_SCHEMA.TABLE_STORAGE`
WHERE table_name = 'events';
Redshift Cost Controls
-- Use workload management queues to prevent runaway queries
CREATE WLM QUEUE reporting
WITH
QUERY_CONCURRENCY = 5
MEMORY_PERCENT_TO_USE = 40;
CREATE WLM QUEUE etl
WITH
QUERY_CONCURRENCY = 3
MEMORY_PERCENT_TO_USE = 60;
-- Use concurrency scaling for unpredictable BI spikes
ALTER CLUSTER my_cluster
CONCURRENCY_SCALING TARGET_QUERY_CONCURRENCY = 5;
Terraform for Infrastructure Management
# Terraform: Snowflake infrastructure as code
resource "snowflake_warehouse" "etl" {
name = "ETL_WH"
warehouse_size = "x-large"
auto_suspend = 60
auto_resume = true
resource_monitor = snowflake_resource_monitor.monthly_budget.name
}
resource "snowflake_resource_monitor" "monthly_budget" {
name = "MONTHLY_BUDGET"
credit_quota = 5000
frequency = "MONTHLY"
triggers {
threshold_percent = 80
action = "NOTIFY"
}
triggers {
threshold_percent = 110
action = "SUSPEND"
}
}
resource "snowflake_database" "analytics" {
name = "ANALYTICS"
comment = "Production analytics database"
}
Data Sharing and Collaboration
Cloud data warehouses enable data sharing across organizational boundaries without copying data.
Snowflake Data Sharing
-- Create a share and add objects
CREATE SHARE partner_analytics;
GRANT USAGE ON DATABASE analytics TO SHARE partner_analytics;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.marts TO SHARE partner_analytics;
ALTER SHARE partner_analytics SET ACCOUNTS = [PARTNER_ACCOUNT_LOCATOR];
-- The partner queries the data from their own Snowflake account
-- No data movement, no ETL, always current
Delta Sharing (Databricks)
-- Databricks: create a Delta Share
CREATE SHARE quarterly_report;
ALTER SHARE quarterly_report ADD TABLE analytics.marts.financial_summary;
ALTER SHARE quarterly_report ADD TABLE analytics.marts.customer_metrics;
-- Recipients access the share via their own Databricks workspace
-- or via any Delta Sharing compatible client (Python, Spark, Pandas)
Multi-Cloud Considerations
Organizations increasingly operate across multiple clouds. Cloud data warehouse decisions affect data residency, failover strategy, and vendor leverage.
Snowflake: Multi-Cloud by Design
Snowflake runs on AWS, GCP, and Azure with consistent features and SQL dialect. This enables multi-cloud strategies:
-- Snowflake: replicate data across clouds for disaster recovery
CREATE REPLICATION GROUP dr_group
OBJECT_TYPES = DATABASES, SHARES
ALLOWED_LOCATIONS = ('aws-eu-west-1', 'azure-eu-west');
ALTER DATABASE analytics ENABLE REPLICATION
FOR REPLICATION GROUP dr_group;
Data Residency Patterns
# Python: enforce data residency routing
class DataResidencyRouter:
"""Route data to the correct cloud region based on residency requirements."""
REGION_MAP = {
"EU": {"cloud": "aws", "region": "eu-central-1", "account": "EU_ACCOUNT"},
"US": {"cloud": "aws", "region": "us-east-1", "account": "US_ACCOUNT"},
"APAC": {"cloud": "azure", "region": "southeastasia", "account": "APAC_ACCOUNT"},
}
def route_table(self, table: str, residency: str):
config = self.REGION_MAP[residency]
print(f"Routing {table} to {config['cloud']}/{config['region']}")
# Execute CREATE TABLE with appropriate location binding
return config
Migration Step-by-Step Methodology
Phase 1: Assessment (Weeks 1-4)
- Inventory workloads. List every ETL job, dashboard, report, and ad-hoc query. Map them to tables and schemas.
- Profile queries. Identify top-50 queries by execution time and frequency. These drive performance requirements.
- Measure data volumes. Row counts, table sizes, growth rates. Understand what must be migrated immediately and what can wait.
- Document dependencies. Which tables feed which reports? What are SLA expectations?
- Choose migration approach per workload. Lift-and-shift for 80% of tables, re-architect for 20% (the complex or high-value ones).
Phase 2: Foundation (Weeks 5-8)
- Provision cloud warehouse. Set up accounts, roles, warehouses, resource monitors, network policies.
- Establish data loading pipeline. Set up COPY commands, external stages, and scheduled ingestion.
- Replicate schema. Use tools like Schema Migration Tool (Snowflake), AWS DMS, or manual DDL conversion.
- Validate data fidelity. Row counts, checksums, NULL distributions. Every column must match.
Phase 3: Migration (Weeks 9-20)
- Parallel run. Run legacy and cloud systems simultaneously. Route read-only workloads to the cloud.
- Migrate ETL jobs batch by batch. Start with daily batch jobs, move to real-time streaming last.
- Optimize as you go. Clustering keys, materialized views, and caching. Iterate based on actual performance.
- Retire legacy components. Once a workload is validated in the cloud, shut down the corresponding legacy job.
Phase 4: Optimization (Ongoing)
- Revisit data models. After 30 days of cloud-native operation, identify star schemas that can be denormalized.
- Automate cost monitoring. Set up alerts for anomalous spend.
- Train the team. Help analysts use cloud-native features — zero-copy clones, time travel, query profiling.
Common Pitfalls and How to Avoid Them
Pitfall 1: Ignoring Concurrency Requirements
Legacy warehouses handle 10-50 concurrent queries. Cloud warehouses can handle 500+ — but only if you configure concurrency properly. Many teams migrate a query that ran in 5 seconds on Teradata and find it runs in 5 seconds on Snowflake for one user, but 30 seconds for ten.
Fix: Profile your query concurrency needs before migration. In Snowflake, use multi-cluster warehouses: MAX_CLUSTER_COUNT = 5. In BigQuery, use flat-rate reservations with enough slots.
Pitfall 2: Keeping the Same Data Model
The Monstrosity Fact Table — a fact table with 47 dimensions and 120 measures — does not perform better in the cloud. It performs worse, because cloud warehouses charge by bytes scanned.
Fix: Split monster fact tables into focused fact tables per business process. Use dbt to manage the refactoring.
Pitfall 3: Missing Cost Visibility
The biggest surprise teams face: their first Snowflake bill is 3X the Teradata maintenance cost. Not because Snowflake is expensive, but because nobody suspended warehouses after loading.
Fix: Implement resource monitors day one. Set AUTO_SUSPEND = 60 (seconds) on all warehouses. Educate every user that a running warehouse costs money.
Pitfall 4: Not Validating Semi-Structured Data
JSON columns that worked in Oracle’s XMLDB or MongoDB may not translate cleanly to Snowflake’s VARIANT or BigQuery’s RECORD type. Deeply nested JSON arrays with varying schemas cause silent data loss or query failures.
-- Snowflake: flatten nested JSON correctly
SELECT
event_id,
event_timestamp,
payload:user:id::STRING AS user_id,
payload:metadata:page::STRING AS page,
item.value:product_id::STRING AS product_id,
item.value:price::FLOAT AS price
FROM raw_events,
LATERAL FLATTEN(input => payload:items) AS item;
-- Without LATERAL FLATTEN, array items inside JSON are invisible to queries
Fix: Test with your actual JSON structure during the assessment phase. Create a test table, load 1 million records, and validate every field is accessible.
Pitfall 5: Underestimating Reverse ETL
Data warehouses traditionally served BI tools. Modern architectures require reverse ETL — pushing transformed data back into operational systems (CRMs, marketing platforms). Cloud warehouses support this natively through Snowflake Tasks, BigQuery procedures, or dedicated reverse ETL tools (Hightouch, Census).
-- Snowflake: push aggregated metrics back to Salesforce via task
CREATE TASK push_customer_metrics_to_salesforce
WAREHOUSE = etl_wh
SCHEDULE = '1 HOUR'
AS
CALL write_to_salesforce_api(
object_type => 'Account',
data_query => $$
SELECT
customer_id AS External_Id__c,
total_revenue,
lifetime_orders
FROM analytics.marts.customer_kpi
WHERE last_updated >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
$$
);
Fix: Include reverse ETL in your architecture from the start. It is not an afterthought — it is a core integration pattern.
Pitfall 6: Neglecting Data Governance
Cloud warehouses make it trivially easy to share data. That is a feature and a risk. Without proper access controls, sensitive data can leak through data shares, zero-copy clones, or even query result caching.
-- Snowflake: implement row-level security
CREATE SECURE VIEW customer_sales AS
SELECT *
FROM fact_sales
WHERE customer_id IN (
SELECT customer_id
FROM dim_customer
WHERE region = CURRENT_USER_REGION()
);
-- Mask sensitive columns
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_STEWARD', 'ADMIN') THEN val
ELSE CONCAT(LEFT(val, 2), '****@', SPLIT_PART(val, '@', -1))
END;
ALTER TABLE dim_customer
MODIFY COLUMN email
SET MASKING POLICY email_mask;
Conclusion
Data warehouse modernization is not a technology problem — it is a migration problem. The technology choices are clear: Snowflake, BigQuery, Redshift, and Databricks all deliver storage-compute separation, elastic scaling, and cloud-native economics. The hard part is the transition — inventorying workloads, managing parallel runs, validating data fidelity, and retraining teams.
Start with a thorough assessment. Choose lift-and-shift for the bulk of your tables to move quickly. Re-architect the 20% of workloads that drive 80% of the value. Implement cost controls on day one, not after the first surprise bill. Validate every row of data during migration. And build governance into the architecture from the start — retrofitting access controls is far harder than designing them in.
Organizations that execute this transition methodically reduce total cost of ownership by 30-50%, eliminate capacity planning cycles, and deliver analytics insights in minutes instead of hours. The effort is substantial. The payoff is transformative.
Resources
- Snowflake Modernization Guide
- Google BigQuery Migration Overview
- Amazon Redshift Migration Playbook
- Databricks Lakehouse Migration
- dbt Labs: Data Modeling for Cloud Warehouses
- Data Vault 2.0 Official Site
- Kimball Group Dimensional Modeling Techniques
- Delta Sharing Protocol Specification
- Cost Optimization for Cloud Data Warehouses (AWS)
Comments