Introduction
Database failures are inevitable. What matters is how quickly you recover. High availability (HA) strategies minimize downtime through replication, failover, and redundancy. However, implementing HA correctly requires understanding trade-offs between consistency, availability, and partition tolerance (CAP theorem).
This comprehensive guide covers database HA strategies and failover mechanisms.
Core Concepts
High Availability (HA)
System’s ability to continue operating despite failures.
Replication
Copying data across multiple servers.
Failover
Automatic switching to backup system.
Recovery Time Objective (RTO)
Maximum acceptable downtime.
Recovery Point Objective (RPO)
Maximum acceptable data loss.
Synchronous Replication
Data written to all replicas before confirming.
Asynchronous Replication
Data written to primary, replicated to secondaries.
Split Brain
Situation where multiple nodes think they’re primary.
Quorum
Majority of nodes required for decision.
HA Architectures
Active-Passive (Master-Slave)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Primary (Active) โ
โ - Accepts reads and writes โ
โ - Replicates to secondary โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Replication
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Secondary (Passive) โ
โ - Accepts reads only โ
โ - Ready for failover โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Failover: Manual or automatic promotion of secondary
Active-Active (Multi-Master)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Primary 1 (Active) โ
โ - Accepts reads and writes โ
โ - Replicates to Primary 2 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Bidirectional Replication
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Primary 2 (Active) โ
โ - Accepts reads and writes โ
โ - Replicates to Primary 1 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Failover: Automatic, no promotion needed
PostgreSQL HA Setup
Streaming Replication
# Primary server configuration
# postgresql.conf
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB
# pg_hba.conf
host replication replication secondary_ip/32 md5
# Create replication user
createuser -U postgres --replication replication
Standby Setup
# On standby server
pg_basebackup -h primary_ip -D /var/lib/postgresql/data -U replication -v -P -W
# Create recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=primary_ip port=5432 user=replication password=password'
Automatic Failover with Patroni
# patroni.yml
scope: postgres-cluster
namespace: /patroni/
name: postgres-1
restapi:
listen: 0.0.0.0:8008
connect_address: node1:8008
etcd:
host: etcd-node:2379
postgresql:
data_dir: /var/lib/postgresql/data
parameters:
wal_level: replica
max_wal_senders: 10
wal_keep_size: 1GB
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
MySQL HA Setup
Master-Slave Replication
-- On Master
CHANGE MASTER TO
MASTER_HOST='master_ip',
MASTER_USER='replication',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=154;
START SLAVE;
-- Check status
SHOW SLAVE STATUS\G
Group Replication
-- Install plugin
INSTALL PLUGIN group_replication SONAME 'group_replication.so';
-- Configure
SET GLOBAL group_replication_group_name="aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee";
SET GLOBAL group_replication_start_on_boot=ON;
SET GLOBAL group_replication_local_address="node1:33061";
SET GLOBAL group_replication_group_seeds="node1:33061,node2:33061,node3:33061";
-- Start replication
START GROUP_REPLICATION;
-- Check status
SELECT * FROM performance_schema.replication_group_members;
Failover Mechanisms
Automatic Failover
class AutomaticFailover:
def __init__(self, primary, secondaries, health_check_interval=5):
self.primary = primary
self.secondaries = secondaries
self.health_check_interval = health_check_interval
def monitor_primary(self):
"""Monitor primary health"""
while True:
if not self.is_healthy(self.primary):
self.trigger_failover()
time.sleep(self.health_check_interval)
def is_healthy(self, server):
"""Check if server is healthy"""
try:
# Try to connect and execute query
conn = psycopg2.connect(
host=server['host'],
port=server['port'],
database='postgres',
user='postgres',
password='password',
connect_timeout=5
)
cursor = conn.cursor()
cursor.execute('SELECT 1')
conn.close()
return True
except Exception as e:
print(f"Health check failed for {server['host']}: {e}")
return False
def trigger_failover(self):
"""Trigger failover to secondary"""
print(f"Primary {self.primary['host']} is down, triggering failover")
# Select best secondary
best_secondary = self.select_best_secondary()
# Promote secondary
self.promote_secondary(best_secondary)
# Update primary reference
self.primary = best_secondary
# Notify applications
self.notify_applications(best_secondary)
def select_best_secondary(self):
"""Select secondary with least replication lag"""
best = None
min_lag = float('inf')
for secondary in self.secondaries:
lag = self.get_replication_lag(secondary)
if lag < min_lag:
min_lag = lag
best = secondary
return best
def promote_secondary(self, secondary):
"""Promote secondary to primary"""
conn = psycopg2.connect(
host=secondary['host'],
port=secondary['port'],
database='postgres',
user='postgres',
password='password'
)
cursor = conn.cursor()
cursor.execute('SELECT pg_promote()')
conn.close()
def notify_applications(self, new_primary):
"""Notify applications of new primary"""
# Update DNS or service discovery
update_service_discovery(new_primary)
Backup and Recovery
Backup Strategy
class BackupStrategy:
def __init__(self, backup_dir):
self.backup_dir = backup_dir
def full_backup(self, database):
"""Create full backup"""
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
backup_file = f"{self.backup_dir}/full_{timestamp}.sql"
subprocess.run([
'pg_dump',
'-h', database['host'],
'-U', database['user'],
'-d', database['name'],
'-F', 'c',
'-f', backup_file
])
return backup_file
def incremental_backup(self, database):
"""Create incremental backup using WAL"""
# WAL archiving provides incremental backups
pass
def restore_backup(self, backup_file, target_database):
"""Restore from backup"""
subprocess.run([
'pg_restore',
'-h', target_database['host'],
'-U', target_database['user'],
'-d', target_database['name'],
backup_file
])
def point_in_time_recovery(self, database, target_time):
"""Recover to specific point in time"""
# Use WAL archives to recover to target_time
pass
Best Practices
- Plan RTO/RPO: Define acceptable downtime and data loss
- Test Failover: Regularly test failover procedures
- Monitor Replication: Track replication lag
- Backup Regularly: Maintain recent backups
- Document Procedures: Document failover steps
- Automate Failover: Minimize manual intervention
- Avoid Split Brain: Use quorum-based decisions
- Monitor Lag: Alert on high replication lag
- Capacity Planning: Ensure secondaries can handle load
- Regular Testing: Test recovery procedures
External Resources
PostgreSQL HA
MySQL HA
Tools
Conclusion
Database high availability is essential for production systems. By implementing replication, automatic failover, and regular backups, you minimize downtime and data loss.
Start with master-slave replication, add automatic failover, and gradually implement more sophisticated HA strategies as needed.
Database HA is the foundation of reliable systems.
Comments