Introduction
Data warehouse costs can spiral quickly. With petabytes of data and millions of queries, optimization is essential for controlling cloud spend while maintaining performance.
Key Statistics:
- Average data warehouse waste: 40% of spend
- Proper optimization reduces costs by 40-60%
- Storage tiering alone saves 70% on historical data
- Query optimization improves performance 10x while reducing costs
Cost Optimization Strategies
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Warehouse Cost Layers โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Compute Costs โ โ
โ โ โโโ Query optimization โ โ
โ โ โโโ Warehouses/size scaling โ โ
โ โ โโโ Concurrency management โ โ
โ โ โโโ Result caching โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Storage Costs โ โ
โ โ โโโ Data compression โ โ
โ โ โโโ Tiered storage โ โ
โ โ โโโ Partitioning and clustering โ โ
โ โ โโโ Data lifecycle management โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Egress Costs โ โ
โ โ โโโ Result caching โ โ
โ โ โโโ Federated queries โ โ
โ โ โโโ Data sharing โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Snowflake Optimization
Compute Optimization
-- Use result cache (automatic)
-- Snowflake caches results for 24 hours
-- Query below uses cached results if available
SELECT * FROM sales WHERE date >= '2026-01-01';
-- Use materialized views for frequent aggregations
CREATE MATERIALIZED VIEW monthly_sales_mv AS
SELECT
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales,
COUNT(*) AS transaction_count
FROM sales
GROUP BY DATE_TRUNC('month', sale_date);
-- Use warehouse scaling
ALTER WAREHOUSE compute_wh SET
warehouse_size = 'LARGE',
auto_suspend = 60,
auto_resume = TRUE;
Storage Optimization
-- Column compression
-- Snowflake automatically compresses
-- Monitor compression ratio
SELECT
table_name,
ROUND(bytes / 1024 / 1024 / 1024, 2) AS size_gb,
ROUND(bytes_not_compressed / bytes * 100, 2) AS compression_ratio
FROM information_schema.tables
WHERE table_schema = 'ANALYTICS';
-- Time travel optimization
-- Reduce time travel for old data
ALTER TABLE sales SET DATA_RETENTION_TIME_IN_DAYS = 1;
-- Use transient tables for intermediate data
CREATE TRANSIENT TABLE staging_data AS
SELECT * FROM raw_data;
-- Clone for dev without extra storage
CREATE TABLE dev_sales CLONE production_sales;
Query Optimization
-- Good: Filter early
SELECT
u.name,
o.order_date,
o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.order_date >= '2026-01-01' -- Filter first
AND u.status = 'active';
-- Bad: Join before filtering
SELECT
u.name,
o.order_date,
o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.order_date >= '2026-01-01';
-- Use approximate functions for large data
SELECT
APPROX_COUNT_DISTINCT(user_id) AS unique_users,
APPROX_PERCENTILE(amount, 0.95) AS p95_amount
FROM events;
BigQuery Optimization
Cost Controls
-- Control query costs with LIMIT
-- Estimate before running
-- In console: Query validator shows bytes processed
-- Use LIMIT to control costs
SELECT * FROM large_table
LIMIT 1000;
-- Partition queries by date
SELECT * FROM events
WHERE _PARTITION_DATE = '2026-01-18';
-- Use clustering for common filters
CREATE TABLE events_clustered
PARTITION BY DATE(event_time)
CLUSTER BY event_type, user_id
AS SELECT * FROM events;
Cost Monitoring
-- Query job costs
SELECT
job_id,
query,
total_bytes_processed,
total_bytes_billed,
creation_time,
state
FROM `region-us.INFORMATION_SCHEMA.JOBS`
WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP()
ORDER BY total_bytes_processed DESC
LIMIT 100;
Slot Management
# Reservation configuration
resource "google_bigquery_reservation" "reservation" {
project = "my-project"
location = "US"
name = "default-reservation"
slot_capacity = 100
ignore_idle = false
}
# Assignment to project
resource "google_bigquery_reservation_assignment" "assignment" {
project_id = "my-project"
assignment_id = "analytics-assignment"
location = "US"
job_priority = "RUNNING"
reservation_reference {
reservation_id = google_bigquery_reservation.reservation.name
project = google_bigquery_reservation.reservation.project
location = google_bigquery_reservation.reservation.location
}
}
Redshift Optimization
Distribution Keys
-- Choose distribution key wisely
CREATE TABLE sales (
sale_id BIGINT,
user_id BIGINT,
amount DECIMAL(10,2),
sale_date DATE
)
DISTKEY(user_id) -- Join key
SORTKEY(sale_date); -- Common filter
-- Use AUTO distribution
CREATE TABLE sales_auto (
sale_id BIGINT,
user_id BIGINT,
amount DECIMAL(10,2)
)
AUTO;
Compression
-- Analyze compression
ANALYZE COMPRESSION sales;
-- Apply compression encoding
CREATE TABLE sales_compressed (
sale_id BIGINT ENCODE RAW,
user_id BIGINT ENCODE ZSTD,
amount DECIMAL(10,2) ENCODE ZSTD,
sale_date DATE ENCODE AZ64
)
DISTKEY(user_id)
SORTKEY(sale_date);
Cost Monitoring Dashboard
#!/usr/bin/env python3
"""Data warehouse cost monitoring."""
import boto3
from datetime import datetime, timedelta
class DataWarehouseCostMonitor:
"""Monitor and alert on data warehouse costs."""
def __init__(self):
self.ce = boto3.client('ce')
self.redshift = boto3.client('redshift')
def get_daily_costs(self, days=30):
"""Get daily cost breakdown."""
end = datetime.now()
start = end - timedelta(days=days)
response = self.ce.get_cost_and_usage(
TimePeriod={
'Start': start.strftime('%Y-%m-%d'),
'End': end.strftime('%Y-%m-%d')
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'SERVICE'}
]
)
return response['ResultsByTime']
def get_query_costs(self):
"""Get Redshift query costs."""
response = self.redshift.describe_usage_limits()
return response['UsageLimits']
def generate_report(self):
"""Generate cost optimization report."""
costs = self.get_daily_costs()
report = {
'period': f"{costs[0]['TimePeriod']['Start']} to {costs[-1]['TimePeriod']['End']}",
'total_cost': sum(
float(day['Total']['UnblendedCost']['Amount'])
for day in costs
),
'daily_average': 0,
'recommendations': []
}
report['daily_average'] = report['total_cost'] / len(costs)
# Add recommendations
if report['daily_average'] > 100:
report['recommendations'].append(
"Consider using on-demand pricing or scaling down warehouses"
)
return report
Best Practices Summary
| Strategy | Savings | Effort |
|---|---|---|
| Result Caching | 30-50% | None |
| Storage Tiering | 50-70% | Medium |
| Query Optimization | 20-40% | Medium |
| Warehouse Right-sizing | 30-50% | Low |
| Reservation/Commit | 20-40% | Low |
| Partitioning | 40-60% | Medium |
Comments