Database Failover: High Availability Strategies

Introduction

Database failures are inevitable. What matters is how quickly you recover. High availability (HA) strategies minimize downtime through replication, failover, and redundancy. However, implementing HA correctly requires understanding trade-offs between consistency, availability, and partition tolerance (CAP theorem).

This comprehensive guide covers database HA strategies and failover mechanisms.

Core Concepts

High Availability (HA)

System’s ability to continue operating despite failures.

Replication

Copying data across multiple servers.

Failover

Automatic switching to backup system.

Recovery Time Objective (RTO)

Maximum acceptable downtime.

Recovery Point Objective (RPO)

Maximum acceptable data loss.

Synchronous Replication

Data written to all replicas before confirming.

Asynchronous Replication

Data written to primary, replicated to secondaries.

Split Brain

Situation where multiple nodes think they’re primary.

Quorum

Majority of nodes required for decision.

HA Architectures

Active-Passive (Master-Slave)

┌─────────────────────────────────────┐
│      Primary (Active)               │
│  - Accepts reads and writes         │
│  - Replicates to secondary          │
└─────────────────────────────────────┘
           │
           │ Replication
           ▼
┌─────────────────────────────────────┐
│      Secondary (Passive)            │
│  - Accepts reads only               │
│  - Ready for failover               │
└─────────────────────────────────────┘

Failover: Manual or automatic promotion of secondary

Active-Active (Multi-Master)

┌─────────────────────────────────────┐
│      Primary 1 (Active)             │
│  - Accepts reads and writes         │
│  - Replicates to Primary 2          │
└─────────────────────────────────────┘
           │
           │ Bidirectional Replication
           ▼
┌─────────────────────────────────────┐
│      Primary 2 (Active)             │
│  - Accepts reads and writes         │
│  - Replicates to Primary 1          │
└─────────────────────────────────────┘

Failover: Automatic, no promotion needed

PostgreSQL HA Setup

Streaming Replication

# Primary server configuration
# postgresql.conf
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB

# pg_hba.conf
host    replication     replication     secondary_ip/32     md5

# Create replication user
createuser -U postgres --replication replication

Standby Setup

# On standby server
pg_basebackup -h primary_ip -D /var/lib/postgresql/data -U replication -v -P -W

# Create recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=primary_ip port=5432 user=replication password=password'

Automatic Failover with Patroni

# patroni.yml
scope: postgres-cluster
namespace: /patroni/
name: postgres-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008

etcd:
  host: etcd-node:2379

postgresql:
  data_dir: /var/lib/postgresql/data
  parameters:
    wal_level: replica
    max_wal_senders: 10
    wal_keep_size: 1GB

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica

MySQL HA Setup

Master-Slave Replication

-- On Master
CHANGE MASTER TO
  MASTER_HOST='master_ip',
  MASTER_USER='replication',
  MASTER_PASSWORD='password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=154;

START SLAVE;

-- Check status
SHOW SLAVE STATUS\G

Group Replication

-- Install plugin
INSTALL PLUGIN group_replication SONAME 'group_replication.so';

-- Configure
SET GLOBAL group_replication_group_name="aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee";
SET GLOBAL group_replication_start_on_boot=ON;
SET GLOBAL group_replication_local_address="node1:33061";
SET GLOBAL group_replication_group_seeds="node1:33061,node2:33061,node3:33061";

-- Start replication
START GROUP_REPLICATION;

-- Check status
SELECT * FROM performance_schema.replication_group_members;

Failover Mechanisms

Automatic Failover

class AutomaticFailover:
    def __init__(self, primary, secondaries, health_check_interval=5):
        self.primary = primary
        self.secondaries = secondaries
        self.health_check_interval = health_check_interval
    
    def monitor_primary(self):
        """Monitor primary health"""
        while True:
            if not self.is_healthy(self.primary):
                self.trigger_failover()
            time.sleep(self.health_check_interval)
    
    def is_healthy(self, server):
        """Check if server is healthy"""
        try:
            # Try to connect and execute query
            conn = psycopg2.connect(
                host=server['host'],
                port=server['port'],
                database='postgres',
                user='postgres',
                password='password',
                connect_timeout=5
            )
            cursor = conn.cursor()
            cursor.execute('SELECT 1')
            conn.close()
            return True
        except Exception as e:
            print(f"Health check failed for {server['host']}: {e}")
            return False
    
    def trigger_failover(self):
        """Trigger failover to secondary"""
        print(f"Primary {self.primary['host']} is down, triggering failover")
        
        # Select best secondary
        best_secondary = self.select_best_secondary()
        
        # Promote secondary
        self.promote_secondary(best_secondary)
        
        # Update primary reference
        self.primary = best_secondary
        
        # Notify applications
        self.notify_applications(best_secondary)
    
    def select_best_secondary(self):
        """Select secondary with least replication lag"""
        best = None
        min_lag = float('inf')
        
        for secondary in self.secondaries:
            lag = self.get_replication_lag(secondary)
            if lag < min_lag:
                min_lag = lag
                best = secondary
        
        return best
    
    def promote_secondary(self, secondary):
        """Promote secondary to primary"""
        conn = psycopg2.connect(
            host=secondary['host'],
            port=secondary['port'],
            database='postgres',
            user='postgres',
            password='password'
        )
        cursor = conn.cursor()
        cursor.execute('SELECT pg_promote()')
        conn.close()
    
    def notify_applications(self, new_primary):
        """Notify applications of new primary"""
        # Update DNS or service discovery
        update_service_discovery(new_primary)

Backup and Recovery

Backup Strategy

class BackupStrategy:
    def __init__(self, backup_dir):
        self.backup_dir = backup_dir
    
    def full_backup(self, database):
        """Create full backup"""
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        backup_file = f"{self.backup_dir}/full_{timestamp}.sql"
        
        subprocess.run([
            'pg_dump',
            '-h', database['host'],
            '-U', database['user'],
            '-d', database['name'],
            '-F', 'c',
            '-f', backup_file
        ])
        
        return backup_file
    
    def incremental_backup(self, database):
        """Create incremental backup using WAL"""
        # WAL archiving provides incremental backups
        pass
    
    def restore_backup(self, backup_file, target_database):
        """Restore from backup"""
        subprocess.run([
            'pg_restore',
            '-h', target_database['host'],
            '-U', target_database['user'],
            '-d', target_database['name'],
            backup_file
        ])
    
    def point_in_time_recovery(self, database, target_time):
        """Recover to specific point in time"""
        # Use WAL archives to recover to target_time
        pass

Best Practices

Plan RTO/RPO: Define acceptable downtime and data loss
Test Failover: Regularly test failover procedures
Monitor Replication: Track replication lag
Backup Regularly: Maintain recent backups
Document Procedures: Document failover steps
Automate Failover: Minimize manual intervention
Avoid Split Brain: Use quorum-based decisions
Monitor Lag: Alert on high replication lag
Capacity Planning: Ensure secondaries can handle load
Regular Testing: Test recovery procedures

External Resources

PostgreSQL HA

MySQL HA

Tools

Conclusion

Database high availability is essential for production systems. By implementing replication, automatic failover, and regular backups, you minimize downtime and data loss.

Start with master-slave replication, add automatic failover, and gradually implement more sophisticated HA strategies as needed.

Database HA is the foundation of reliable systems.