Introduction
Disaster recovery is not a question of if, but when. The average cost of IT downtime is $300,000 per hour, with some industries seeing costs exceeding $1 million per hour. Building automated disaster recovery isn’t just good practiceโit’s essential for business survival.
Key Statistics:
- 60% of companies that lose data close within 6 months
- Average recovery time: 23 hours without automation, 4 hours with automation
- 95% of companies with DR plans survive ransomware attacks
- Multi-region deployments reduce outage risk by 85%
Understanding RTO and RPO
Definitions
| Metric | Definition | Target Examples |
|---|---|---|
| RTO (Recovery Time Objective) | Maximum acceptable time to restore service | 15 minutes, 1 hour, 4 hours |
| RPO (Recovery Point Objective) | Maximum acceptable data loss (time-based) | 0 (synchronous), 5 minutes, 1 hour |
Choosing Targets by Tier
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Recovery Objectives by Tier โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Tier โ RTO โ RPO โ Use Case โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Mission-Criticalโ 15 min โ 0 โ Financial transactions โ
โ Critical โ 1 hour โ 5 min โ Core business applications โ
โ Important โ 4 hours โ 1 hr โ Internal tools, CRM โ
โ Standard โ 24 hours โ 4 hrs โ Development, testing โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Multi-Region Architecture
Active-Active Architecture
# Kubernetes multi-region deployment
apiVersion: v1
kind: Service
metadata:
name: app-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
type: LoadBalancer
selector:
app: myapp
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
DATABASE_REPLICA_HOST: "db.us-east-1.rds.amazonaws.com"
DATABASE_REPLICA_HOST_EU: "db.eu-west-1.rds.amazonaws.com"
REDIS_ENDPOINT: "cluster.cfg.use1.cache.amazonaws.com"
REDIS_ENDPOINT_EU: "cluster.cfg.euw1.cache.amazonaws.com"
Global Database Replication
-- PostgreSQL Multi-Region Setup
-- Primary in US East, Replica in EU West
-- On Primary (US East)
CREATE PUBLICATION db_publication FOR ALL TABLES;
-- On Replica (EU West)
CREATE SUBSCRIPTION db_subscription
CONNECTION 'host=primary.us-east-1.rds.amazonaws.com
port=5432
dbname=mydb
user=repl_user
password=xxx'
PUBLICATION db_publication;
-- Verify replication status
SELECT * FROM pg_stat_replication;
DNS Failover
# Route 53 Health Check and Failover
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "primary",
"HealthCheckId": "abc123",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z2FDTNDATAQYW2",
"DNSName": "dualstack.app-primary-123456789.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
{
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z2FDTNDATAQYW2",
"DNSName": "dualstack.app-secondary-987654321.eu-west-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
#!/usr/bin/env python3
"""DNS health check and failover automation."""
import boto3
import requests
from datetime import datetime
ROUTE53 = boto3.client('route53')
HEALTH_CHECK_TAG = 'auto-failover-enabled'
def check_application_health(url):
"""Check if application is healthy."""
try:
response = requests.get(f"{url}/health", timeout=5)
return response.status_code == 200
except Exception:
return False
def update_dns_record(hosted_zone_id, record_name, health_check_id, failover_type):
"""Update DNS record based on health check status."""
response = ROUTE53.change_resource_record_sets(
HostedZoneId=hosted_zone_id,
ChangeBatch={
'Changes': [{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': record_name,
'Type': 'A',
'Failover': failover_type,
'TTL': 30,
'ResourceRecords': [
{'Value': '1.2.3.4'}
]
}
}]
}
)
return response
def main():
primary_healthy = check_application_health('https://app-primary.example.com')
secondary_healthy = check_application_health('https://app-secondary.example.com')
print(f"[{datetime.now()}] Primary: {primary_healthy}, Secondary: {secondary_healthy}")
if not primary_healthy and secondary_healthy:
print("Promoting secondary region...")
# Trigger failover logic
if __name__ == '__main__':
main()
Automated Backup Strategies
Database Backups
# PostgreSQL automated backup
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "0 */4 * * *" # Every 4 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
env:
- name: PGHOST
value: "postgres.default.svc.cluster.local"
- name: PGDATABASE
value: "mydb"
- name: AWS_REGION
value: "us-east-1"
command:
- /bin/sh
- -c
- |
DATE=$(date +%Y%m%d_%H%M%S)
pg_dump -U $PGUSER $PGDATABASE | gzip | \
aws s3 cp - s3://backups-bucket/postgres/${PGDATABASE}_${DATE}.sql.gz
volumeMounts:
- name: aws-credentials
mountPath: /root/.aws
volumes:
- name: aws-credentials
secret:
secretName: aws-backup-credentials
restartPolicy: OnFailure
Kubernetes Persistent Volume Snapshots
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: data-snapshot-daily
spec:
volumeSnapshotClassName: aws-ebs-snapshot-class
source:
persistentVolumeClaimName: data-pvc
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: snapshot-backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: snapshot-creator
containers:
- name: create-snapshot
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: snapshot-$(date +%Y%m%d)
spec:
volumeSnapshotClassName: aws-ebs-snapshot-class
source:
persistentVolumeClaimName: data-pvc
EOF
restartPolicy: OnFailure
Application-Level Backups
#!/usr/bin/env python3
"""Application state backup automation."""
import boto3
import json
from datetime import datetime
from pathlib import Path
S3 = boto3.client('s3')
DYNAMODB = boto3.resource('dynamodb')
BACKUP_BUCKET = 'app-backups'
RETENTION_DAYS = 30
def backup_dynamodb_table(table_name, backup_name):
"""Create DynamoDB backup."""
table = DYNAMODB.Table(table_name)
backup_spec = {
'TableName': table_name,
'BackupName': f"{backup_name}-{datetime.now().strftime('%Y%m%d%H%M%S')}"
}
response = table.meta.client.create_backup(**backup_spec)
return response['BackupDetails']['BackupArn']
def backup_s3_bucket(source_bucket, prefix=''):
"""Copy S3 bucket to backup location."""
paginator = S3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=source_bucket, Prefix=prefix):
if 'Contents' in page:
for obj in page['Contents']:
copy_source = {'Bucket': source_bucket, 'Key': obj['Key']}
dest_key = f"backups/{datetime.now().strftime('%Y%m%d')}/{obj['Key']}"
S3.copy_object(CopySource=copy_source, Bucket=BACKUP_BUCKET, Key=dest_key)
def cleanup_old_backups():
"""Remove backups older than retention period."""
cutoff_date = datetime.now().replace(hour=0, minute=0, second=0)
response = S3.list_objects_v2(Bucket=BACKUP_BUCKET, Prefix='backups/')
if 'Contents' in response:
for obj in response['Contents']:
if obj['LastModified'].replace(tzinfo=None) < cutoff_date:
S3.delete_object(Bucket=BACKUP_BUCKET, Key=obj['Key'])
print(f"Deleted: {obj['Key']}")
if __name__ == '__main__':
backup_dynamodb_table('app-state', 'daily-backup')
backup_s3_bucket('app-data')
cleanup_old_backups()
Automated Failover
Database Failover
#!/usr/bin/env python3
"""Database failover automation."""
import boto3
import time
from datetime import datetime
RDS = boto3.client('rds')
CLOUDFORMATION = boto3.client('cloudformation')
PRIMARY_REGION = 'us-east-1'
SECONDARY_REGION = 'eu-west-1'
def trigger_rds_failover(db_instance_id):
"""Trigger RDS failover to replica."""
print(f"[{datetime.now()}] Initiating failover for {db_instance_id}")
response = RDS.failover_db_cluster(
DBClusterIdentifier=db_instance_id,
TargetDBInstanceId=f"{db_instance_id}-replica"
)
# Wait for failover to complete
waiter = RDS.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier=f"{db_instance_id}-replica")
print(f"[{datetime.now()}] Failover completed")
def update_connection_string(new_host):
"""Update application connection string."""
ssm = boto3.client('ssm')
ssm.put_parameter(
Name='/app/database/host',
Value=new_host,
Type='SecureString',
Overwrite=True
)
# Trigger secret rotation if using Secrets Manager
client = boto3.client('secretsmanager')
client.rotate_secret(
SecretId='app/database-credentials'
)
def notify_team(failover_type, status):
"""Send notifications to on-call team."""
sns = boto3.client('sns')
message = {
'failover_type': failover_type,
'status': status,
'timestamp': datetime.now().isoformat()
}
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789:oncall-alerts',
Subject=f'Database Failover: {status}',
Message=json.dumps(message)
)
Kubernetes Failover
# External DNS with health checks
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
name: app-endpoint
spec:
endpoints:
- dnsName: app.example.com
recordTTL: 60
recordType: A
targets:
- 1.2.3.4
providerSpecific:
- name: aws/weight
value: "100"
---
# Ingress with failover
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
spec:
ingressClassName: nginx
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 80
Testing DR Procedures
Chaos Engineering for DR
# Litmus chaos experiment for failover testing
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-failover-test
namespace: litmus
spec:
appinfo:
appns: production
applabel: "app=myapp"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-failure
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
DR Test Playbook
#!/bin/bash
# dr-test.sh - Automated DR testing
set -e
RED='\033[0;31m'
GREEN='\033[0;32m'
NC='\033[0m'
log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }
TEST_REGION="us-west-2"
BACKUP_REGION="eu-central-1"
log_info "Starting DR test..."
# 1. Verify backups exist
log_info "Checking backup availability..."
aws s3 ls s3://backups-bucket/ || { log_error "No backups found"; exit 1; }
# 2. Restore from backup in test region
log_info "Restoring database in test region..."
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier dr-test-restore \
--db-snapshot-identifier latest-backup \
--db-instance-class db.t3.medium \
--region $BACKUP_REGION
# 3. Wait for restoration
log_info "Waiting for restoration..."
aws rds wait db-instance-available \
--db-instance-identifier dr-test-restore \
--region $BACKUP_REGION
# 4. Deploy test infrastructure
log_info "Deploying test infrastructure..."
kubectl config use-context $TEST_REGION
kubectl apply -f k8s/dr-test/
# 5. Run smoke tests
log_info "Running smoke tests..."
curl -f https://dr-test.example.com/health || { log_error "Health check failed"; exit 1; }
# 6. Verify data integrity
log_info "Verifying data integrity..."
# Add data verification scripts here
log_info "DR test completed successfully!"
# Cleanup
log_info "Cleaning up test resources..."
aws rds delete-db-instance --db-instance-identifier dr-test-restore --skip-final-snapshot
Cost Optimization
Backup Storage Tiers
#!/usr/bin/env python3
"""S3 lifecycle policy for backup cost optimization."""
import boto3
s3 = boto3.client('s3')
def configure_lifecycle_policy(bucket_name):
"""Configure S3 lifecycle for backups."""
lifecycle_rules = [
{
'ID': 'daily-backups-30days',
'Status': 'Enabled',
'Filter': {'Prefix': 'backups/daily/'},
'Transitions': [
{'Days': 1, 'StorageClass': 'GLACIER'},
{'Days': 30, 'StorageClass': 'DEEP_ARCHIVE'},
{'Days': 90, 'StorageClass': 'DEEP_ARCHIVE'}
],
'Expiration': {'Days': 365}
},
{
'ID': 'weekly-backups-90days',
'Status': 'Enabled',
'Filter': {'Prefix': 'backups/weekly/'},
'Transitions': [
{'Days': 7, 'StorageClass': 'GLACIER'},
{'Days': 90, 'StorageClass': 'DEEP_ARCHIVE'}
],
'Expiration': {'Days': 730}
}
]
s3.put_bucket_lifecycle_configuration(
Bucket=bucket_name,
LifecycleConfiguration={'Rules': lifecycle_rules}
)
if __name__ == '__main__':
configure_lifecycle_policy('app-backups')
External Resources
Related Articles
- High Availability & Disaster Recovery
- Kubernetes Cost Optimization
- Spot Instances: Fault-Tolerant Architecture
Comments