Spot Instances: Fault-Tolerant, 80% Cheaper Architecture

Introduction

Spot Instances are AWS’s ultra-discounted compute offering: spare EC2 capacity at 70-90% discount compared to on-demand pricing. The tradeoff: AWS can terminate them with 2-minute notice.

For fault-tolerant workloads, Spot Instances are the lowest-cost computing option available. This guide shows you how to architect for Spot, implement best practices, and save millions annually.

Spot Pricing Fundamentals

How Spot Pricing Works

AWS has spare compute capacity in each availability zone and instance type. Rather than leave it idle, they sell it at steep discounts.

On-Demand price: $0.0965/hour
Spot price (typical): $0.0289/hour (70% discount)
Spot price (variable): $0.0200-$0.0400/hour
Savings: $0.0576/hour = $504/year per instance

Spot vs On-Demand: Full Comparison

Feature                 On-Demand  Spot      Savings
─────────────────────────────────────────────────────
Hourly rate            $0.0965    $0.0289   70%
Cost predictability    High       Variable  N/A
Availability           99.95%     70-95%    N/A
Interruption notice    N/A        2 minutes N/A
Suitable for           All        Fault-    N/A
                       workloads  tolerant

Annual cost per instance:
On-Demand: $845
Spot:      $253
Savings:   $592

Spot Instance Interruptions

Why Spot Gets Interrupted

AWS balances supply and demand:

Capacity is low
On-demand demand increases
Spot instances are reclaimed for higher revenue
You get 2-minute termination notice

Interruption Rates by Instance Type

General Purpose (t3, m5):
- Interruption rate: 3-5%
- Mean time to interruption: 600+ hours

Compute Optimized (c5, c6):
- Interruption rate: 2-4%
- Mean time to interruption: 1,000+ hours

Memory Optimized (r5, x1):
- Interruption rate: 4-6%
- Mean time to interruption: 500+ hours

GPU (p3, g4):
- Interruption rate: 5-10%
- Mean time to interruption: 200+ hours

Risk Calculation

Example: 100 instances running daily

With 3% interruption rate:
- Expected instances interrupted per day: 3
- Recovery time: 2 minutes
- Manual recovery: ~5 minutes
- Automated recovery: <1 minute

This is manageable with fault-tolerant design

Fault-Tolerant Architecture Patterns

Pattern 1: Auto Scaling Group with Mixed Instances

Architecture:

Mix of On-Demand (20%) and Spot (80%) instances
Auto-scaling handles failures

# Terraform configuration
resource "aws_autoscaling_group" "web" {
  name = "web-asg"
  
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_instance_pools                      = 4
    }
    
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web.id
        version            = "$Latest"
      }
      
      # Diversify instance types to reduce interruption risk
      overrides = [
        { instance_type = "t3.large" },
        { instance_type = "t3a.large" },
        { instance_type = "m5.large" },
        { instance_type = "m5a.large" },
      ]
    }
  }
  
  min_size         = 4
  max_size         = 20
  desired_capacity = 10
}

Cost Calculation:

10 instances:
- 2 on-demand: 2 × $0.0965 × 730 = $141/month
- 8 spot: 8 × $0.0289 × 730 = $169/month
Total: $310/month

Equivalent on-demand:
- 10 × $0.0965 × 730 = $705/month

Savings: $395/month (56%)

Pattern 2: Stateless Services with Load Balancing

Architecture:

Stateless containerized services
Load balancer distributes traffic
Lost connections are automatically re-routed

Client Request
     ↓
Load Balancer
  ↙    ↓    ↘
Spot  Spot  On-Demand
Instance interruption?
     ↓
Load Balancer removes from pool
Auto Scaling launches replacement

Pattern 3: Batch Processing with Spot Fleet

Architecture:

Break work into small tasks
Use Spot Fleet to scale wide
Tolerate individual instance loss

# AWS Batch with Spot
# Job definition uses Spot instances
# If instance is interrupted, job restarts on another instance

job_definition = {
    'jobDefinitionName': 'batch-process',
    'containerProperties': {
        'image': 'my-batch-app:latest',
        'vcpus': 1,
        'memory': 2048,
    },
    'computeEnvironment': 'spot-compute-env'
}

# Spot Compute Environment with 4 instance types
compute_env = {
    'type': 'SPOT',
    'instanceTypes': ['t3.large', 't3a.large', 'm5.large', 'm5a.large'],
    'desiredvCpus': 100,
}

Spot Instance Best Practices

Practice 1: Instance Type Diversification

Problem: All Spot instances of same type = higher interruption risk

# BAD: All t3.large
instance_type = "t3.large"
# If t3.large Spot price spikes, all instances vulnerable

# GOOD: Mix of instance types
instances = ["t3.large", "t3a.large", "m5.large", "m5a.large"]
# Even if one type becomes scarce, others still available

Practice 2: Capacity Rebalancing

# Modern AWS ASG feature
# Replaces at-risk Spot instances proactively
# Better than waiting for interruption

resource "aws_autoscaling_group" "web" {
  capacity_rebalance = true  # Enable proactive replacement
}

Practice 3: Spot Instance Interruption Handling

# Handle 2-minute warning gracefully
import signal
import sys

def handle_interruption(signum, frame):
    print("2-minute warning received")
    # Drain connections
    # Stop accepting new requests
    # Exit cleanly
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_interruption)

# Also monitor EC2 instance metadata
# GET http://169.254.169.254/latest/api/token
# THEN GET http://169.254.169.254/latest/meta-data/spot/instance-action

Practice 4: Max Price Strategy

Spot bidding:
- Set max price = On-Demand price
- Interruption only if demand exceeds supply
- Maintain service quality

# Conservative approach:
max_price = on_demand_price * 0.9  # 10% above typical Spot

# Aggressive approach:
max_price = on_demand_price * 0.95  # Lower savings, higher availability

Spot Use Cases

Ideal Use Cases (Great Savings)

Batch Processing
- MapReduce jobs
- Data analysis
- Image processing
- CI/CD builds
- Savings: 80-90%
Development/Testing
- Dev environment instances
- Load testing
- Staging environments
- Savings: 80-90%
Analytics
- EMR clusters
- Spark jobs
- Data warehousing queries
- Savings: 70-80%
Web Applications (with ASG)
- Stateless APIs
- Microservices
- Backend workers
- Savings: 50-70%

Not Suitable Use Cases

Stateful Services
- Databases with persistent state
- Session storage
- Long-running transactions
Interactive Applications
- User-facing applications
- Real-time streaming
- Live collaboration tools
High-Availability Critical
- Payment processing
- Mission-critical systems
- Compliance-sensitive workloads

Real-World Cost Analysis

Case Study 1: Startup SaaS Platform

Original Setup (All On-Demand):

10x t3.large web servers: $705/month
5x t3.large cache servers: $353/month
20x t3.medium batch workers: $292/month
Total: $1,350/month

Optimized (Mixed On-Demand + Spot):

2x t3.large web (on-demand): $141/month
8x t3.large web (spot): $184/month
2x t3.large cache (on-demand): $141/month
3x t3.large cache (spot): $70/month
20x t3.medium batch (spot): $87/month
Total: $623/month

Savings: $727/month (54% reduction) **Annual Savings**: $8,724

Case Study 2: Data Processing Pipeline

Original (On-Demand):

100x c5.xlarge instances for 10 hours/day
Cost: 100 × $0.17 × 10 × 260 = $44,200/month

Spot Fleet (90% Spot):

100x c5.xlarge (90% spot, 10% on-demand)
90x spot @ $0.051/hour
10x on-demand @ $0.17/hour
Cost: (90 × $0.051 + 10 × $0.17) × 10 × 260 = $14,040/month

Savings: $30,160/month (68% reduction) **Annual Savings**: $361,920

Spot Monitoring and Alerting

CloudWatch Metrics for Spot

# Monitor spot interruption rate
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='high-spot-interruption-rate',
    MetricName='GroupTerminatingInstances',
    Namespace='AWS/AutoScaling',
    Statistic='Sum',
    Period=3600,
    EvaluationPeriods=2,
    Threshold=5,  # Alert if >5 interruptions/hour
    ComparisonOperator='GreaterThanThreshold',
)

Spot Instance Advisor

AWS provides Spot Instance Advisor (https://aws.amazon.com/ec2/spot/instance-advisor/) with:

Interruption frequency by instance type
Historical pricing trends
Recommendations for diversification

Spot Pricing Strategies

Strategy 1: Reserved Capacity for Baseline

Workload pattern:
- Baseline: 20 instances always needed
- Peak: 100 instances 8 hours/day
- Off-peak: 5 instances

Recommendation:
- Reserve: 20 instances (on-demand or 1-year RI)
- Spot Fleet: Scale from 20-100 based on demand
- Off-peak: Auto-scale down to 5

Cost:
- Reserved: $1,461/month (20 × $73)
- Spot scaling: $184/month (savings when not in peak)
- Total: $1,645/month

vs. All on-demand (100 instances average):
- $7,048/month
- Savings: $5,403/month (77%)

Strategy 2: Spot + On-Demand Blend

Tiered approach:
- Critical requests: On-demand instances
- Non-critical requests: Spot instances

Load balancer with weighted target groups:
- Target group 1 (on-demand): weight 20%, requests critical
- Target group 2 (spot): weight 80%, requests non-critical

Cost:
- On-demand tier: Handles 20% peak + redundancy
- Spot tier: Handles 80% demand (interrupted instances auto-replace)

Result: 80% of traffic on cheap Spot, with graceful degradation

Spot Savings Calculator

def calculate_spot_savings(
    instance_type,
    quantity,
    hours_per_day,
    days_per_month,
    on_demand_percentage=20,
    spot_percentage=80
):
    on_demand_price = get_on_demand_price(instance_type)
    spot_price = get_spot_price(instance_type)
    
    on_demand_count = int(quantity * on_demand_percentage / 100)
    spot_count = int(quantity * spot_percentage / 100)
    
    on_demand_monthly = (
        on_demand_count * on_demand_price * 
        hours_per_day * days_per_month
    )
    
    spot_monthly = (
        spot_count * spot_price * 
        hours_per_day * days_per_month
    )
    
    all_on_demand = (
        quantity * on_demand_price * 
        hours_per_day * days_per_month
    )
    
    savings = all_on_demand - (on_demand_monthly + spot_monthly)
    savings_percent = (savings / all_on_demand) * 100
    
    return {
        'current_cost': on_demand_monthly + spot_monthly,
        'on_demand_cost': all_on_demand,
        'monthly_savings': savings,
        'annual_savings': savings * 12,
        'savings_percent': savings_percent
    }

# Example
result = calculate_spot_savings(
    instance_type='t3.large',
    quantity=20,
    hours_per_day=8,
    days_per_month=22
)

# Output:
# Current cost (20% on-demand, 80% spot): $196/month
# On-demand cost: $289/month
# Monthly savings: $93
# Annual savings: $1,116
# Savings %: 32%

Glossary

Spot Instance: AWS excess capacity at steep discount
Interruption: AWS terminating Spot instance to reclaim capacity
Spot Fleet: Group of Spot instances launched together
Capacity Rebalancing: Proactively replacing at-risk Spot instances
On-Demand: Standard hourly pricing without interruption risk
Fault-Tolerant: System handles component failures gracefully
Stateless: Application without persistent local state

Spot Instances: Fault-Tolerant, 80% Cheaper Architecture

Introduction

Spot Pricing Fundamentals

How Spot Pricing Works

Spot vs On-Demand: Full Comparison

Spot Instance Interruptions

Why Spot Gets Interrupted

Interruption Rates by Instance Type

Risk Calculation

Fault-Tolerant Architecture Patterns

Pattern 1: Auto Scaling Group with Mixed Instances

Pattern 2: Stateless Services with Load Balancing

Pattern 3: Batch Processing with Spot Fleet

Spot Instance Best Practices

Practice 1: Instance Type Diversification

Practice 2: Capacity Rebalancing

Practice 3: Spot Instance Interruption Handling

Practice 4: Max Price Strategy

Spot Use Cases

Ideal Use Cases (Great Savings)

Not Suitable Use Cases

Real-World Cost Analysis

Case Study 1: Startup SaaS Platform

Case Study 2: Data Processing Pipeline

Spot Monitoring and Alerting

CloudWatch Metrics for Spot

Spot Instance Advisor

Spot Pricing Strategies

Strategy 1: Reserved Capacity for Baseline

Strategy 2: Spot + On-Demand Blend

Spot Savings Calculator

Glossary

Resources

Comments