Introduction
Spot Instances are AWS’s ultra-discounted compute offering: spare EC2 capacity at 70-90% discount compared to on-demand pricing. The tradeoff: AWS can terminate them with 2-minute notice.
For fault-tolerant workloads, Spot Instances are the lowest-cost computing option available. This guide shows you how to architect for Spot, implement best practices, and save millions annually.
Spot Pricing Fundamentals
How Spot Pricing Works
AWS has spare compute capacity in each availability zone and instance type. Rather than leave it idle, they sell it at steep discounts.
On-Demand price: $0.0965/hour
Spot price (typical): $0.0289/hour (70% discount)
Spot price (variable): $0.0200-$0.0400/hour
Savings: $0.0576/hour = $504/year per instance
Spot vs On-Demand: Full Comparison
Feature On-Demand Spot Savings
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Hourly rate $0.0965 $0.0289 70%
Cost predictability High Variable N/A
Availability 99.95% 70-95% N/A
Interruption notice N/A 2 minutes N/A
Suitable for All Fault- N/A
workloads tolerant
Annual cost per instance:
On-Demand: $845
Spot: $253
Savings: $592
Spot Instance Interruptions
Why Spot Gets Interrupted
AWS balances supply and demand:
- Capacity is low
- On-demand demand increases
- Spot instances are reclaimed for higher revenue
- You get 2-minute termination notice
Interruption Rates by Instance Type
General Purpose (t3, m5):
- Interruption rate: 3-5%
- Mean time to interruption: 600+ hours
Compute Optimized (c5, c6):
- Interruption rate: 2-4%
- Mean time to interruption: 1,000+ hours
Memory Optimized (r5, x1):
- Interruption rate: 4-6%
- Mean time to interruption: 500+ hours
GPU (p3, g4):
- Interruption rate: 5-10%
- Mean time to interruption: 200+ hours
Risk Calculation
Example: 100 instances running daily
With 3% interruption rate:
- Expected instances interrupted per day: 3
- Recovery time: 2 minutes
- Manual recovery: ~5 minutes
- Automated recovery: <1 minute
This is manageable with fault-tolerant design
Fault-Tolerant Architecture Patterns
Pattern 1: Auto Scaling Group with Mixed Instances
Architecture:
- Mix of On-Demand (20%) and Spot (80%) instances
- Auto-scaling handles failures
# Terraform configuration
resource "aws_autoscaling_group" "web" {
name = "web-asg"
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 20
spot_instance_pools = 4
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web.id
version = "$Latest"
}
# Diversify instance types to reduce interruption risk
overrides = [
{ instance_type = "t3.large" },
{ instance_type = "t3a.large" },
{ instance_type = "m5.large" },
{ instance_type = "m5a.large" },
]
}
}
min_size = 4
max_size = 20
desired_capacity = 10
}
Cost Calculation:
10 instances:
- 2 on-demand: 2 ร $0.0965 ร 730 = $141/month
- 8 spot: 8 ร $0.0289 ร 730 = $169/month
Total: $310/month
Equivalent on-demand:
- 10 ร $0.0965 ร 730 = $705/month
Savings: $395/month (56%)
Pattern 2: Stateless Services with Load Balancing
Architecture:
- Stateless containerized services
- Load balancer distributes traffic
- Lost connections are automatically re-routed
Client Request
โ
Load Balancer
โ โ โ
Spot Spot On-Demand
Instance interruption?
โ
Load Balancer removes from pool
Auto Scaling launches replacement
Pattern 3: Batch Processing with Spot Fleet
Architecture:
- Break work into small tasks
- Use Spot Fleet to scale wide
- Tolerate individual instance loss
# AWS Batch with Spot
# Job definition uses Spot instances
# If instance is interrupted, job restarts on another instance
job_definition = {
'jobDefinitionName': 'batch-process',
'containerProperties': {
'image': 'my-batch-app:latest',
'vcpus': 1,
'memory': 2048,
},
'computeEnvironment': 'spot-compute-env'
}
# Spot Compute Environment with 4 instance types
compute_env = {
'type': 'SPOT',
'instanceTypes': ['t3.large', 't3a.large', 'm5.large', 'm5a.large'],
'desiredvCpus': 100,
}
Spot Instance Best Practices
Practice 1: Instance Type Diversification
Problem: All Spot instances of same type = higher interruption risk
# BAD: All t3.large
instance_type = "t3.large"
# If t3.large Spot price spikes, all instances vulnerable
# GOOD: Mix of instance types
instances = ["t3.large", "t3a.large", "m5.large", "m5a.large"]
# Even if one type becomes scarce, others still available
Practice 2: Capacity Rebalancing
# Modern AWS ASG feature
# Replaces at-risk Spot instances proactively
# Better than waiting for interruption
resource "aws_autoscaling_group" "web" {
capacity_rebalance = true # Enable proactive replacement
}
Practice 3: Spot Instance Interruption Handling
# Handle 2-minute warning gracefully
import signal
import sys
def handle_interruption(signum, frame):
print("2-minute warning received")
# Drain connections
# Stop accepting new requests
# Exit cleanly
sys.exit(0)
signal.signal(signal.SIGTERM, handle_interruption)
# Also monitor EC2 instance metadata
# GET http://169.254.169.254/latest/api/token
# THEN GET http://169.254.169.254/latest/meta-data/spot/instance-action
Practice 4: Max Price Strategy
Spot bidding:
- Set max price = On-Demand price
- Interruption only if demand exceeds supply
- Maintain service quality
# Conservative approach:
max_price = on_demand_price * 0.9 # 10% above typical Spot
# Aggressive approach:
max_price = on_demand_price * 0.95 # Lower savings, higher availability
Spot Use Cases
Ideal Use Cases (Great Savings)
-
Batch Processing
- MapReduce jobs
- Data analysis
- Image processing
- CI/CD builds
- Savings: 80-90%
-
Development/Testing
- Dev environment instances
- Load testing
- Staging environments
- Savings: 80-90%
-
Analytics
- EMR clusters
- Spark jobs
- Data warehousing queries
- Savings: 70-80%
-
Web Applications (with ASG)
- Stateless APIs
- Microservices
- Backend workers
- Savings: 50-70%
Not Suitable Use Cases
-
Stateful Services
- Databases with persistent state
- Session storage
- Long-running transactions
-
Interactive Applications
- User-facing applications
- Real-time streaming
- Live collaboration tools
-
High-Availability Critical
- Payment processing
- Mission-critical systems
- Compliance-sensitive workloads
Real-World Cost Analysis
Case Study 1: Startup SaaS Platform
Original Setup (All On-Demand):
- 10x t3.large web servers: $705/month
- 5x t3.large cache servers: $353/month
- 20x t3.medium batch workers: $292/month
- Total: $1,350/month
Optimized (Mixed On-Demand + Spot):
- 2x t3.large web (on-demand): $141/month
- 8x t3.large web (spot): $184/month
- 2x t3.large cache (on-demand): $141/month
- 3x t3.large cache (spot): $70/month
- 20x t3.medium batch (spot): $87/month
- Total: $623/month
Savings: $727/month (54% reduction) **Annual Savings**: $8,724
Case Study 2: Data Processing Pipeline
Original (On-Demand):
- 100x c5.xlarge instances for 10 hours/day
- Cost: 100 ร $0.17 ร 10 ร 260 = $44,200/month
Spot Fleet (90% Spot):
- 100x c5.xlarge (90% spot, 10% on-demand)
- 90x spot @ $0.051/hour
- 10x on-demand @ $0.17/hour
- Cost: (90 ร $0.051 + 10 ร $0.17) ร 10 ร 260 = $14,040/month
Savings: $30,160/month (68% reduction) **Annual Savings**: $361,920
Spot Monitoring and Alerting
CloudWatch Metrics for Spot
# Monitor spot interruption rate
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='high-spot-interruption-rate',
MetricName='GroupTerminatingInstances',
Namespace='AWS/AutoScaling',
Statistic='Sum',
Period=3600,
EvaluationPeriods=2,
Threshold=5, # Alert if >5 interruptions/hour
ComparisonOperator='GreaterThanThreshold',
)
Spot Instance Advisor
AWS provides Spot Instance Advisor (https://aws.amazon.com/ec2/spot/instance-advisor/) with:
- Interruption frequency by instance type
- Historical pricing trends
- Recommendations for diversification
Spot Pricing Strategies
Strategy 1: Reserved Capacity for Baseline
Workload pattern:
- Baseline: 20 instances always needed
- Peak: 100 instances 8 hours/day
- Off-peak: 5 instances
Recommendation:
- Reserve: 20 instances (on-demand or 1-year RI)
- Spot Fleet: Scale from 20-100 based on demand
- Off-peak: Auto-scale down to 5
Cost:
- Reserved: $1,461/month (20 ร $73)
- Spot scaling: $184/month (savings when not in peak)
- Total: $1,645/month
vs. All on-demand (100 instances average):
- $7,048/month
- Savings: $5,403/month (77%)
Strategy 2: Spot + On-Demand Blend
Tiered approach:
- Critical requests: On-demand instances
- Non-critical requests: Spot instances
Load balancer with weighted target groups:
- Target group 1 (on-demand): weight 20%, requests critical
- Target group 2 (spot): weight 80%, requests non-critical
Cost:
- On-demand tier: Handles 20% peak + redundancy
- Spot tier: Handles 80% demand (interrupted instances auto-replace)
Result: 80% of traffic on cheap Spot, with graceful degradation
Spot Savings Calculator
def calculate_spot_savings(
instance_type,
quantity,
hours_per_day,
days_per_month,
on_demand_percentage=20,
spot_percentage=80
):
on_demand_price = get_on_demand_price(instance_type)
spot_price = get_spot_price(instance_type)
on_demand_count = int(quantity * on_demand_percentage / 100)
spot_count = int(quantity * spot_percentage / 100)
on_demand_monthly = (
on_demand_count * on_demand_price *
hours_per_day * days_per_month
)
spot_monthly = (
spot_count * spot_price *
hours_per_day * days_per_month
)
all_on_demand = (
quantity * on_demand_price *
hours_per_day * days_per_month
)
savings = all_on_demand - (on_demand_monthly + spot_monthly)
savings_percent = (savings / all_on_demand) * 100
return {
'current_cost': on_demand_monthly + spot_monthly,
'on_demand_cost': all_on_demand,
'monthly_savings': savings,
'annual_savings': savings * 12,
'savings_percent': savings_percent
}
# Example
result = calculate_spot_savings(
instance_type='t3.large',
quantity=20,
hours_per_day=8,
days_per_month=22
)
# Output:
# Current cost (20% on-demand, 80% spot): $196/month
# On-demand cost: $289/month
# Monthly savings: $93
# Annual savings: $1,116
# Savings %: 32%
Glossary
- Spot Instance: AWS excess capacity at steep discount
- Interruption: AWS terminating Spot instance to reclaim capacity
- Spot Fleet: Group of Spot instances launched together
- Capacity Rebalancing: Proactively replacing at-risk Spot instances
- On-Demand: Standard hourly pricing without interruption risk
- Fault-Tolerant: System handles component failures gracefully
- Stateless: Application without persistent local state
Comments