Skip to main content
โšก Calmops

Spot Instances: Fault-Tolerant, 80% Cheaper Architecture

Introduction

Spot Instances are AWS’s ultra-discounted compute offering: spare EC2 capacity at 70-90% discount compared to on-demand pricing. The tradeoff: AWS can terminate them with 2-minute notice.

For fault-tolerant workloads, Spot Instances are the lowest-cost computing option available. This guide shows you how to architect for Spot, implement best practices, and save millions annually.


Spot Pricing Fundamentals

How Spot Pricing Works

AWS has spare compute capacity in each availability zone and instance type. Rather than leave it idle, they sell it at steep discounts.

On-Demand price: $0.0965/hour
Spot price (typical): $0.0289/hour (70% discount)
Spot price (variable): $0.0200-$0.0400/hour
Savings: $0.0576/hour = $504/year per instance

Spot vs On-Demand: Full Comparison

Feature                 On-Demand  Spot      Savings
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Hourly rate            $0.0965    $0.0289   70%
Cost predictability    High       Variable  N/A
Availability           99.95%     70-95%    N/A
Interruption notice    N/A        2 minutes N/A
Suitable for           All        Fault-    N/A
                       workloads  tolerant

Annual cost per instance:
On-Demand: $845
Spot:      $253
Savings:   $592

Spot Instance Interruptions

Why Spot Gets Interrupted

AWS balances supply and demand:

  1. Capacity is low
  2. On-demand demand increases
  3. Spot instances are reclaimed for higher revenue
  4. You get 2-minute termination notice

Interruption Rates by Instance Type

General Purpose (t3, m5):
- Interruption rate: 3-5%
- Mean time to interruption: 600+ hours

Compute Optimized (c5, c6):
- Interruption rate: 2-4%
- Mean time to interruption: 1,000+ hours

Memory Optimized (r5, x1):
- Interruption rate: 4-6%
- Mean time to interruption: 500+ hours

GPU (p3, g4):
- Interruption rate: 5-10%
- Mean time to interruption: 200+ hours

Risk Calculation

Example: 100 instances running daily

With 3% interruption rate:
- Expected instances interrupted per day: 3
- Recovery time: 2 minutes
- Manual recovery: ~5 minutes
- Automated recovery: <1 minute

This is manageable with fault-tolerant design

Fault-Tolerant Architecture Patterns

Pattern 1: Auto Scaling Group with Mixed Instances

Architecture:

  • Mix of On-Demand (20%) and Spot (80%) instances
  • Auto-scaling handles failures
# Terraform configuration
resource "aws_autoscaling_group" "web" {
  name = "web-asg"
  
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_instance_pools                      = 4
    }
    
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.web.id
        version            = "$Latest"
      }
      
      # Diversify instance types to reduce interruption risk
      overrides = [
        { instance_type = "t3.large" },
        { instance_type = "t3a.large" },
        { instance_type = "m5.large" },
        { instance_type = "m5a.large" },
      ]
    }
  }
  
  min_size         = 4
  max_size         = 20
  desired_capacity = 10
}

Cost Calculation:

10 instances:
- 2 on-demand: 2 ร— $0.0965 ร— 730 = $141/month
- 8 spot: 8 ร— $0.0289 ร— 730 = $169/month
Total: $310/month

Equivalent on-demand:
- 10 ร— $0.0965 ร— 730 = $705/month

Savings: $395/month (56%)

Pattern 2: Stateless Services with Load Balancing

Architecture:

  • Stateless containerized services
  • Load balancer distributes traffic
  • Lost connections are automatically re-routed
Client Request
     โ†“
Load Balancer
  โ†™    โ†“    โ†˜
Spot  Spot  On-Demand
Instance interruption?
     โ†“
Load Balancer removes from pool
Auto Scaling launches replacement

Pattern 3: Batch Processing with Spot Fleet

Architecture:

  • Break work into small tasks
  • Use Spot Fleet to scale wide
  • Tolerate individual instance loss
# AWS Batch with Spot
# Job definition uses Spot instances
# If instance is interrupted, job restarts on another instance

job_definition = {
    'jobDefinitionName': 'batch-process',
    'containerProperties': {
        'image': 'my-batch-app:latest',
        'vcpus': 1,
        'memory': 2048,
    },
    'computeEnvironment': 'spot-compute-env'
}

# Spot Compute Environment with 4 instance types
compute_env = {
    'type': 'SPOT',
    'instanceTypes': ['t3.large', 't3a.large', 'm5.large', 'm5a.large'],
    'desiredvCpus': 100,
}

Spot Instance Best Practices

Practice 1: Instance Type Diversification

Problem: All Spot instances of same type = higher interruption risk

# BAD: All t3.large
instance_type = "t3.large"
# If t3.large Spot price spikes, all instances vulnerable

# GOOD: Mix of instance types
instances = ["t3.large", "t3a.large", "m5.large", "m5a.large"]
# Even if one type becomes scarce, others still available

Practice 2: Capacity Rebalancing

# Modern AWS ASG feature
# Replaces at-risk Spot instances proactively
# Better than waiting for interruption

resource "aws_autoscaling_group" "web" {
  capacity_rebalance = true  # Enable proactive replacement
}

Practice 3: Spot Instance Interruption Handling

# Handle 2-minute warning gracefully
import signal
import sys

def handle_interruption(signum, frame):
    print("2-minute warning received")
    # Drain connections
    # Stop accepting new requests
    # Exit cleanly
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_interruption)

# Also monitor EC2 instance metadata
# GET http://169.254.169.254/latest/api/token
# THEN GET http://169.254.169.254/latest/meta-data/spot/instance-action

Practice 4: Max Price Strategy

Spot bidding:
- Set max price = On-Demand price
- Interruption only if demand exceeds supply
- Maintain service quality

# Conservative approach:
max_price = on_demand_price * 0.9  # 10% above typical Spot

# Aggressive approach:
max_price = on_demand_price * 0.95  # Lower savings, higher availability

Spot Use Cases

Ideal Use Cases (Great Savings)

  1. Batch Processing

    • MapReduce jobs
    • Data analysis
    • Image processing
    • CI/CD builds
    • Savings: 80-90%
  2. Development/Testing

    • Dev environment instances
    • Load testing
    • Staging environments
    • Savings: 80-90%
  3. Analytics

    • EMR clusters
    • Spark jobs
    • Data warehousing queries
    • Savings: 70-80%
  4. Web Applications (with ASG)

    • Stateless APIs
    • Microservices
    • Backend workers
    • Savings: 50-70%

Not Suitable Use Cases

  1. Stateful Services

    • Databases with persistent state
    • Session storage
    • Long-running transactions
  2. Interactive Applications

    • User-facing applications
    • Real-time streaming
    • Live collaboration tools
  3. High-Availability Critical

    • Payment processing
    • Mission-critical systems
    • Compliance-sensitive workloads

Real-World Cost Analysis

Case Study 1: Startup SaaS Platform

Original Setup (All On-Demand):

  • 10x t3.large web servers: $705/month
  • 5x t3.large cache servers: $353/month
  • 20x t3.medium batch workers: $292/month
  • Total: $1,350/month

Optimized (Mixed On-Demand + Spot):

  • 2x t3.large web (on-demand): $141/month
  • 8x t3.large web (spot): $184/month
  • 2x t3.large cache (on-demand): $141/month
  • 3x t3.large cache (spot): $70/month
  • 20x t3.medium batch (spot): $87/month
  • Total: $623/month

Savings: $727/month (54% reduction) **Annual Savings**: $8,724

Case Study 2: Data Processing Pipeline

Original (On-Demand):

  • 100x c5.xlarge instances for 10 hours/day
  • Cost: 100 ร— $0.17 ร— 10 ร— 260 = $44,200/month

Spot Fleet (90% Spot):

  • 100x c5.xlarge (90% spot, 10% on-demand)
  • 90x spot @ $0.051/hour
  • 10x on-demand @ $0.17/hour
  • Cost: (90 ร— $0.051 + 10 ร— $0.17) ร— 10 ร— 260 = $14,040/month

Savings: $30,160/month (68% reduction) **Annual Savings**: $361,920


Spot Monitoring and Alerting

CloudWatch Metrics for Spot

# Monitor spot interruption rate
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='high-spot-interruption-rate',
    MetricName='GroupTerminatingInstances',
    Namespace='AWS/AutoScaling',
    Statistic='Sum',
    Period=3600,
    EvaluationPeriods=2,
    Threshold=5,  # Alert if >5 interruptions/hour
    ComparisonOperator='GreaterThanThreshold',
)

Spot Instance Advisor

AWS provides Spot Instance Advisor (https://aws.amazon.com/ec2/spot/instance-advisor/) with:

  • Interruption frequency by instance type
  • Historical pricing trends
  • Recommendations for diversification

Spot Pricing Strategies

Strategy 1: Reserved Capacity for Baseline

Workload pattern:
- Baseline: 20 instances always needed
- Peak: 100 instances 8 hours/day
- Off-peak: 5 instances

Recommendation:
- Reserve: 20 instances (on-demand or 1-year RI)
- Spot Fleet: Scale from 20-100 based on demand
- Off-peak: Auto-scale down to 5

Cost:
- Reserved: $1,461/month (20 ร— $73)
- Spot scaling: $184/month (savings when not in peak)
- Total: $1,645/month

vs. All on-demand (100 instances average):
- $7,048/month
- Savings: $5,403/month (77%)

Strategy 2: Spot + On-Demand Blend

Tiered approach:
- Critical requests: On-demand instances
- Non-critical requests: Spot instances

Load balancer with weighted target groups:
- Target group 1 (on-demand): weight 20%, requests critical
- Target group 2 (spot): weight 80%, requests non-critical

Cost:
- On-demand tier: Handles 20% peak + redundancy
- Spot tier: Handles 80% demand (interrupted instances auto-replace)

Result: 80% of traffic on cheap Spot, with graceful degradation

Spot Savings Calculator

def calculate_spot_savings(
    instance_type,
    quantity,
    hours_per_day,
    days_per_month,
    on_demand_percentage=20,
    spot_percentage=80
):
    on_demand_price = get_on_demand_price(instance_type)
    spot_price = get_spot_price(instance_type)
    
    on_demand_count = int(quantity * on_demand_percentage / 100)
    spot_count = int(quantity * spot_percentage / 100)
    
    on_demand_monthly = (
        on_demand_count * on_demand_price * 
        hours_per_day * days_per_month
    )
    
    spot_monthly = (
        spot_count * spot_price * 
        hours_per_day * days_per_month
    )
    
    all_on_demand = (
        quantity * on_demand_price * 
        hours_per_day * days_per_month
    )
    
    savings = all_on_demand - (on_demand_monthly + spot_monthly)
    savings_percent = (savings / all_on_demand) * 100
    
    return {
        'current_cost': on_demand_monthly + spot_monthly,
        'on_demand_cost': all_on_demand,
        'monthly_savings': savings,
        'annual_savings': savings * 12,
        'savings_percent': savings_percent
    }

# Example
result = calculate_spot_savings(
    instance_type='t3.large',
    quantity=20,
    hours_per_day=8,
    days_per_month=22
)

# Output:
# Current cost (20% on-demand, 80% spot): $196/month
# On-demand cost: $289/month
# Monthly savings: $93
# Annual savings: $1,116
# Savings %: 32%

Glossary

  • Spot Instance: AWS excess capacity at steep discount
  • Interruption: AWS terminating Spot instance to reclaim capacity
  • Spot Fleet: Group of Spot instances launched together
  • Capacity Rebalancing: Proactively replacing at-risk Spot instances
  • On-Demand: Standard hourly pricing without interruption risk
  • Fault-Tolerant: System handles component failures gracefully
  • Stateless: Application without persistent local state

Resources


Comments