Disaster Recovery Automation: RTO/RPO Optimization

Introduction

Disaster recovery is not a question of if, but when. The average cost of IT downtime is $300,000 per hour, with some industries seeing costs exceeding $1 million per hour. Building automated disaster recovery isn’t just good practice—it’s essential for business survival.

Key Statistics:

60% of companies that lose data close within 6 months
Average recovery time: 23 hours without automation, 4 hours with automation
95% of companies with DR plans survive ransomware attacks
Multi-region deployments reduce outage risk by 85%

Understanding RTO and RPO

Definitions

Metric	Definition	Target Examples
RTO (Recovery Time Objective)	Maximum acceptable time to restore service	15 minutes, 1 hour, 4 hours
RPO (Recovery Point Objective)	Maximum acceptable data loss (time-based)	0 (synchronous), 5 minutes, 1 hour

Choosing Targets by Tier

┌─────────────────────────────────────────────────────────────────┐
│                    Recovery Objectives by Tier                   │
├─────────────┬─────────────────────┬─────────────────────────────┤
│   Tier      │    RTO      │ RPO  │        Use Case             │
├─────────────┼─────────────┼──────┼─────────────────────────────┤
│ Mission-Critical│  15 min  │  0   │ Financial transactions      │
│ Critical    │    1 hour  │ 5 min │ Core business applications  │
│ Important   │    4 hours │ 1 hr  │ Internal tools, CRM         │
│ Standard    │   24 hours │ 4 hrs │ Development, testing         │
└─────────────┴─────────────┴──────┴─────────────────────────────┘

Multi-Region Architecture

Active-Active Architecture

# Kubernetes multi-region deployment
apiVersion: v1
kind: Service
metadata:
  name: app-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
  type: LoadBalancer
  selector:
    app: myapp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  DATABASE_REPLICA_HOST: "db.us-east-1.rds.amazonaws.com"
  DATABASE_REPLICA_HOST_EU: "db.eu-west-1.rds.amazonaws.com"
  REDIS_ENDPOINT: "cluster.cfg.use1.cache.amazonaws.com"
  REDIS_ENDPOINT_EU: "cluster.cfg.euw1.cache.amazonaws.com"

Global Database Replication

-- PostgreSQL Multi-Region Setup
-- Primary in US East, Replica in EU West

-- On Primary (US East)
CREATE PUBLICATION db_publication FOR ALL TABLES;

-- On Replica (EU West)
CREATE SUBSCRIPTION db_subscription 
CONNECTION 'host=primary.us-east-1.rds.amazonaws.com 
            port=5432 
            dbname=mydb 
            user=repl_user 
            password=xxx'
PUBLICATION db_publication;

-- Verify replication status
SELECT * FROM pg_stat_replication;

DNS Failover

# Route 53 Health Check and Failover
{
  "Name": "app.example.com",
  "Type": "A",
  "SetIdentifier": "primary",
  "HealthCheckId": "abc123",
  "Failover": "PRIMARY",
  "AliasTarget": {
    "HostedZoneId": "Z2FDTNDATAQYW2",
    "DNSName": "dualstack.app-primary-123456789.us-east-1.elb.amazonaws.com",
    "EvaluateTargetHealth": true
  }
}
{
  "Name": "app.example.com",
  "Type": "A",
  "SetIdentifier": "secondary",
  "Failover": "SECONDARY",
  "AliasTarget": {
    "HostedZoneId": "Z2FDTNDATAQYW2",
    "DNSName": "dualstack.app-secondary-987654321.eu-west-1.elb.amazonaws.com",
    "EvaluateTargetHealth": true
  }
}

#!/usr/bin/env python3
"""DNS health check and failover automation."""

import boto3
import requests
from datetime import datetime

ROUTE53 = boto3.client('route53')
HEALTH_CHECK_TAG = 'auto-failover-enabled'

def check_application_health(url):
    """Check if application is healthy."""
    try:
        response = requests.get(f"{url}/health", timeout=5)
        return response.status_code == 200
    except Exception:
        return False

def update_dns_record(hosted_zone_id, record_name, health_check_id, failover_type):
    """Update DNS record based on health check status."""
    response = ROUTE53.change_resource_record_sets(
        HostedZoneId=hosted_zone_id,
        ChangeBatch={
            'Changes': [{
                'Action': 'UPSERT',
                'ResourceRecordSet': {
                    'Name': record_name,
                    'Type': 'A',
                    'Failover': failover_type,
                    'TTL': 30,
                    'ResourceRecords': [
                        {'Value': '1.2.3.4'}
                    ]
                }
            }]
        }
    )
    return response

def main():
    primary_healthy = check_application_health('https://app-primary.example.com')
    secondary_healthy = check_application_health('https://app-secondary.example.com')
    
    print(f"[{datetime.now()}] Primary: {primary_healthy}, Secondary: {secondary_healthy}")
    
    if not primary_healthy and secondary_healthy:
        print("Promoting secondary region...")
        # Trigger failover logic

if __name__ == '__main__':
    main()

Automated Backup Strategies

Database Backups

# PostgreSQL automated backup
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15-alpine
              env:
                - name: PGHOST
                  value: "postgres.default.svc.cluster.local"
                - name: PGDATABASE
                  value: "mydb"
                - name: AWS_REGION
                  value: "us-east-1"
              command:
                - /bin/sh
                - -c
                - |
                  DATE=$(date +%Y%m%d_%H%M%S)
                  pg_dump -U $PGUSER $PGDATABASE | gzip | \
                    aws s3 cp - s3://backups-bucket/postgres/${PGDATABASE}_${DATE}.sql.gz
              volumeMounts:
                - name: aws-credentials
                  mountPath: /root/.aws
          volumes:
            - name: aws-credentials
              secret:
                secretName: aws-backup-credentials
          restartPolicy: OnFailure

Kubernetes Persistent Volume Snapshots

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: data-snapshot-daily
spec:
  volumeSnapshotClassName: aws-ebs-snapshot-class
  source:
    persistentVolumeClaimName: data-pvc
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: snapshot-backup
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: snapshot-creator
          containers:
            - name: create-snapshot
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  kubectl apply -f - <<EOF
                  apiVersion: snapshot.storage.k8s.io/v1
                  kind: VolumeSnapshot
                  metadata:
                    name: snapshot-$(date +%Y%m%d)
                  spec:
                    volumeSnapshotClassName: aws-ebs-snapshot-class
                    source:
                      persistentVolumeClaimName: data-pvc
                  EOF
          restartPolicy: OnFailure

Application-Level Backups

#!/usr/bin/env python3
"""Application state backup automation."""

import boto3
import json
from datetime import datetime
from pathlib import Path

S3 = boto3.client('s3')
DYNAMODB = boto3.resource('dynamodb')

BACKUP_BUCKET = 'app-backups'
RETENTION_DAYS = 30

def backup_dynamodb_table(table_name, backup_name):
    """Create DynamoDB backup."""
    table = DYNAMODB.Table(table_name)
    
    backup_spec = {
        'TableName': table_name,
        'BackupName': f"{backup_name}-{datetime.now().strftime('%Y%m%d%H%M%S')}"
    }
    
    response = table.meta.client.create_backup(**backup_spec)
    return response['BackupDetails']['BackupArn']

def backup_s3_bucket(source_bucket, prefix=''):
    """Copy S3 bucket to backup location."""
    paginator = S3.get_paginator('list_objects_v2')
    
    for page in paginator.paginate(Bucket=source_bucket, Prefix=prefix):
        if 'Contents' in page:
            for obj in page['Contents']:
                copy_source = {'Bucket': source_bucket, 'Key': obj['Key']}
                dest_key = f"backups/{datetime.now().strftime('%Y%m%d')}/{obj['Key']}"
                S3.copy_object(CopySource=copy_source, Bucket=BACKUP_BUCKET, Key=dest_key)

def cleanup_old_backups():
    """Remove backups older than retention period."""
    cutoff_date = datetime.now().replace(hour=0, minute=0, second=0)
    
    response = S3.list_objects_v2(Bucket=BACKUP_BUCKET, Prefix='backups/')
    
    if 'Contents' in response:
        for obj in response['Contents']:
            if obj['LastModified'].replace(tzinfo=None) < cutoff_date:
                S3.delete_object(Bucket=BACKUP_BUCKET, Key=obj['Key'])
                print(f"Deleted: {obj['Key']}")

if __name__ == '__main__':
    backup_dynamodb_table('app-state', 'daily-backup')
    backup_s3_bucket('app-data')
    cleanup_old_backups()

Automated Failover

Database Failover

#!/usr/bin/env python3
"""Database failover automation."""

import boto3
import time
from datetime import datetime

RDS = boto3.client('rds')
CLOUDFORMATION = boto3.client('cloudformation')

PRIMARY_REGION = 'us-east-1'
SECONDARY_REGION = 'eu-west-1'

def trigger_rds_failover(db_instance_id):
    """Trigger RDS failover to replica."""
    print(f"[{datetime.now()}] Initiating failover for {db_instance_id}")
    
    response = RDS.failover_db_cluster(
        DBClusterIdentifier=db_instance_id,
        TargetDBInstanceId=f"{db_instance_id}-replica"
    )
    
    # Wait for failover to complete
    waiter = RDS.get_waiter('db_instance_available')
    waiter.wait(DBInstanceIdentifier=f"{db_instance_id}-replica")
    
    print(f"[{datetime.now()}] Failover completed")

def update_connection_string(new_host):
    """Update application connection string."""
    ssm = boto3.client('ssm')
    
    ssm.put_parameter(
        Name='/app/database/host',
        Value=new_host,
        Type='SecureString',
        Overwrite=True
    )
    
    # Trigger secret rotation if using Secrets Manager
    client = boto3.client('secretsmanager')
    client.rotate_secret(
        SecretId='app/database-credentials'
    )

def notify_team(failover_type, status):
    """Send notifications to on-call team."""
    sns = boto3.client('sns')
    
    message = {
        'failover_type': failover_type,
        'status': status,
        'timestamp': datetime.now().isoformat()
    }
    
    sns.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789:oncall-alerts',
        Subject=f'Database Failover: {status}',
        Message=json.dumps(message)
    )

Kubernetes Failover

# External DNS with health checks
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: app-endpoint
spec:
  endpoints:
    - dnsName: app.example.com
      recordTTL: 60
      recordType: A
      targets:
        - 1.2.3.4
      providerSpecific:
        - name: aws/weight
          value: "100"
---
# Ingress with failover
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
spec:
  ingressClassName: nginx
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app-service
                port:
                  number: 80

Testing DR Procedures

Chaos Engineering for DR

# Litmus chaos experiment for failover testing
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-failover-test
  namespace: litmus
spec:
  appinfo:
    appns: production
    applabel: "app=myapp"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-failure
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

DR Test Playbook

#!/bin/bash
# dr-test.sh - Automated DR testing

set -e

RED='\033[0;31m'
GREEN='\033[0;32m'
NC='\033[0m'

log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }

TEST_REGION="us-west-2"
BACKUP_REGION="eu-central-1"

log_info "Starting DR test..."

# 1. Verify backups exist
log_info "Checking backup availability..."
aws s3 ls s3://backups-bucket/ || { log_error "No backups found"; exit 1; }

# 2. Restore from backup in test region
log_info "Restoring database in test region..."
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier dr-test-restore \
    --db-snapshot-identifier latest-backup \
    --db-instance-class db.t3.medium \
    --region $BACKUP_REGION

# 3. Wait for restoration
log_info "Waiting for restoration..."
aws rds wait db-instance-available \
    --db-instance-identifier dr-test-restore \
    --region $BACKUP_REGION

# 4. Deploy test infrastructure
log_info "Deploying test infrastructure..."
kubectl config use-context $TEST_REGION
kubectl apply -f k8s/dr-test/

# 5. Run smoke tests
log_info "Running smoke tests..."
curl -f https://dr-test.example.com/health || { log_error "Health check failed"; exit 1; }

# 6. Verify data integrity
log_info "Verifying data integrity..."
# Add data verification scripts here

log_info "DR test completed successfully!"

# Cleanup
log_info "Cleaning up test resources..."
aws rds delete-db-instance --db-instance-identifier dr-test-restore --skip-final-snapshot

Cost Optimization

Backup Storage Tiers

#!/usr/bin/env python3
"""S3 lifecycle policy for backup cost optimization."""

import boto3

s3 = boto3.client('s3')

def configure_lifecycle_policy(bucket_name):
    """Configure S3 lifecycle for backups."""
    
    lifecycle_rules = [
        {
            'ID': 'daily-backups-30days',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'backups/daily/'},
            'Transitions': [
                {'Days': 1, 'StorageClass': 'GLACIER'},
                {'Days': 30, 'StorageClass': 'DEEP_ARCHIVE'},
                {'Days': 90, 'StorageClass': 'DEEP_ARCHIVE'}
            ],
            'Expiration': {'Days': 365}
        },
        {
            'ID': 'weekly-backups-90days',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'backups/weekly/'},
            'Transitions': [
                {'Days': 7, 'StorageClass': 'GLACIER'},
                {'Days': 90, 'StorageClass': 'DEEP_ARCHIVE'}
            ],
            'Expiration': {'Days': 730}
        }
    ]
    
    s3.put_bucket_lifecycle_configuration(
        Bucket=bucket_name,
        LifecycleConfiguration={'Rules': lifecycle_rules}
    )

if __name__ == '__main__':
    configure_lifecycle_policy('app-backups')