Skip to main content
โšก Calmops

Data Warehouse Cost Optimization: Storage, Compute, Scaling

Introduction

Data warehouse costs can spiral quickly. With petabytes of data and millions of queries, optimization is essential for controlling cloud spend while maintaining performance.

Key Statistics:

  • Average data warehouse waste: 40% of spend
  • Proper optimization reduces costs by 40-60%
  • Storage tiering alone saves 70% on historical data
  • Query optimization improves performance 10x while reducing costs

Cost Optimization Strategies

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Data Warehouse Cost Layers                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚                    Compute Costs                         โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Query optimization                                   โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Warehouses/size scaling                            โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Concurrency management                             โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€ Result caching                                     โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚                                                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚                    Storage Costs                        โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Data compression                                   โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Tiered storage                                    โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Partitioning and clustering                        โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€ Data lifecycle management                         โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚                                                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚                    Egress Costs                         โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Result caching                                     โ”‚    โ”‚
โ”‚  โ”‚  โ”œโ”€โ”€ Federated queries                                 โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€ Data sharing                                      โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Snowflake Optimization

Compute Optimization

-- Use result cache (automatic)
-- Snowflake caches results for 24 hours
-- Query below uses cached results if available
SELECT * FROM sales WHERE date >= '2026-01-01';

-- Use materialized views for frequent aggregations
CREATE MATERIALIZED VIEW monthly_sales_mv AS
SELECT 
    DATE_TRUNC('month', sale_date) AS month,
    SUM(amount) AS total_sales,
    COUNT(*) AS transaction_count
FROM sales
GROUP BY DATE_TRUNC('month', sale_date);

-- Use warehouse scaling
ALTER WAREHOUSE compute_wh SET 
    warehouse_size = 'LARGE',
    auto_suspend = 60,
    auto_resume = TRUE;

Storage Optimization

-- Column compression
-- Snowflake automatically compresses
-- Monitor compression ratio
SELECT 
    table_name,
    ROUND(bytes / 1024 / 1024 / 1024, 2) AS size_gb,
    ROUND(bytes_not_compressed / bytes * 100, 2) AS compression_ratio
FROM information_schema.tables
WHERE table_schema = 'ANALYTICS';

-- Time travel optimization
-- Reduce time travel for old data
ALTER TABLE sales SET DATA_RETENTION_TIME_IN_DAYS = 1;

-- Use transient tables for intermediate data
CREATE TRANSIENT TABLE staging_data AS
SELECT * FROM raw_data;

-- Clone for dev without extra storage
CREATE TABLE dev_sales CLONE production_sales;

Query Optimization

-- Good: Filter early
SELECT 
    u.name,
    o.order_date,
    o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.order_date >= '2026-01-01'  -- Filter first
  AND u.status = 'active';

-- Bad: Join before filtering
SELECT 
    u.name,
    o.order_date,
    o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.order_date >= '2026-01-01';

-- Use approximate functions for large data
SELECT 
    APPROX_COUNT_DISTINCT(user_id) AS unique_users,
    APPROX_PERCENTILE(amount, 0.95) AS p95_amount
FROM events;

BigQuery Optimization

Cost Controls

-- Control query costs with LIMIT
-- Estimate before running
-- In console: Query validator shows bytes processed

-- Use LIMIT to control costs
SELECT * FROM large_table
LIMIT 1000;

-- Partition queries by date
SELECT * FROM events
WHERE _PARTITION_DATE = '2026-01-18';

-- Use clustering for common filters
CREATE TABLE events_clustered
PARTITION BY DATE(event_time)
CLUSTER BY event_type, user_id
AS SELECT * FROM events;

Cost Monitoring

-- Query job costs
SELECT 
    job_id,
    query,
    total_bytes_processed,
    total_bytes_billed,
    creation_time,
    state
FROM `region-us.INFORMATION_SCHEMA.JOBS`
WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP()
ORDER BY total_bytes_processed DESC
LIMIT 100;

Slot Management

# Reservation configuration
resource "google_bigquery_reservation" "reservation" {
  project       = "my-project"
  location      = "US"
  name          = "default-reservation"
  slot_capacity = 100
  ignore_idle   = false
}

# Assignment to project
resource "google_bigquery_reservation_assignment" "assignment" {
  project_id    = "my-project"
  assignment_id = "analytics-assignment"
  location      = "US"
  job_priority  = "RUNNING"
  reservation_reference {
    reservation_id = google_bigquery_reservation.reservation.name
    project        = google_bigquery_reservation.reservation.project
    location       = google_bigquery_reservation.reservation.location
  }
}

Redshift Optimization

Distribution Keys

-- Choose distribution key wisely
CREATE TABLE sales (
    sale_id BIGINT,
    user_id BIGINT,
    amount DECIMAL(10,2),
    sale_date DATE
)
DISTKEY(user_id)  -- Join key
SORTKEY(sale_date);  -- Common filter

-- Use AUTO distribution
CREATE TABLE sales_auto (
    sale_id BIGINT,
    user_id BIGINT,
    amount DECIMAL(10,2)
)
AUTO;

Compression

-- Analyze compression
ANALYZE COMPRESSION sales;

-- Apply compression encoding
CREATE TABLE sales_compressed (
    sale_id BIGINT ENCODE RAW,
    user_id BIGINT ENCODE ZSTD,
    amount DECIMAL(10,2) ENCODE ZSTD,
    sale_date DATE ENCODE AZ64
)
DISTKEY(user_id)
SORTKEY(sale_date);

Cost Monitoring Dashboard

#!/usr/bin/env python3
"""Data warehouse cost monitoring."""

import boto3
from datetime import datetime, timedelta

class DataWarehouseCostMonitor:
    """Monitor and alert on data warehouse costs."""
    
    def __init__(self):
        self.ce = boto3.client('ce')
        self.redshift = boto3.client('redshift')
    
    def get_daily_costs(self, days=30):
        """Get daily cost breakdown."""
        
        end = datetime.now()
        start = end - timedelta(days=days)
        
        response = self.ce.get_cost_and_usage(
            TimePeriod={
                'Start': start.strftime('%Y-%m-%d'),
                'End': end.strftime('%Y-%m-%d')
            },
            Granularity='DAILY',
            Metrics=['UnblendedCost'],
            GroupBy=[
                {'Type': 'DIMENSION', 'Key': 'SERVICE'}
            ]
        )
        
        return response['ResultsByTime']
    
    def get_query_costs(self):
        """Get Redshift query costs."""
        
        response = self.redshift.describe_usage_limits()
        
        return response['UsageLimits']
    
    def generate_report(self):
        """Generate cost optimization report."""
        
        costs = self.get_daily_costs()
        
        report = {
            'period': f"{costs[0]['TimePeriod']['Start']} to {costs[-1]['TimePeriod']['End']}",
            'total_cost': sum(
                float(day['Total']['UnblendedCost']['Amount'])
                for day in costs
            ),
            'daily_average': 0,
            'recommendations': []
        }
        
        report['daily_average'] = report['total_cost'] / len(costs)
        
        # Add recommendations
        if report['daily_average'] > 100:
            report['recommendations'].append(
                "Consider using on-demand pricing or scaling down warehouses"
            )
        
        return report

Best Practices Summary

Strategy Savings Effort
Result Caching 30-50% None
Storage Tiering 50-70% Medium
Query Optimization 20-40% Medium
Warehouse Right-sizing 30-50% Low
Reservation/Commit 20-40% Low
Partitioning 40-60% Medium

External Resources


Comments