Databases are the backbone of most applications, yet they’re often the hardest part to automate and operate. Database DevOps bridges the gap between rapid development and stable operations - enabling teams to ship features faster while maintaining data integrity and system reliability.
The Database Challenge
Application development has embraced CI/CD, microservices, and agile methodologies. Databases remain the bottleneck:
- Schema changes require careful planning
- Data migration is risky and complex
- Downtime for migrations affects users
- Rollbacks are difficult and dangerous
- Performance tuning requires expertise
Database DevOps addresses these challenges through automation, testing, and operational excellence.
Schema Migration Automation
Version Control for Database Schema
Treat database schema like application code:
# Directory structure for migrations
migrations/
โโโ 001_create_users.sql
โโโ 002_add_orders_table.sql
โโโ 003_add_user_email_index.sql
โโโ down/
โโโ 003_add_user_email_index.sql
โโโ 002_add_orders_table.sql
โโโ 001_create_users.sql
Migration Tools
Flyway (Java-based, works everywhere):
<!-- pom.xml configuration -->
<configuration>
<url>jdbc:postgresql://localhost:5432/mydb</url>
<user>dbuser</user>
<password>secret</password>
<locations>
<location>filesystem:src/main/resources/db/migration</location>
</locations>
</configuration>
-- V1__Create_users_table.sql
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- V2__Add_orders.sql
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
total DECIMAL(10,2) NOT NULL,
status VARCHAR(50) DEFAULT 'pending'
);
Liquibase (XML/YAML-based, rollback support):
# databasechangelog.yaml
databaseChangeLog:
- changeSet:
id: create-users-table
author: devops
changes:
- createTable:
tableName: users
columns:
- column:
name: id
type: INT
autoIncrement: true
constraints:
primaryKey: true
- column:
name: email
type: VARCHAR(255)
constraints:
unique: true
nullable: false
sqitch (Perl-based, agnostic):
# Initialize sqitch
sqitch init mydb --engine pg
# Add a migration
sqitch add v1 -n "Create users table"
# Deploy
sqitch deploy db:pg://user:pass@localhost/mydb
# Revert
sqitch revert db:pg://user:pass@localhost/mydb
Zero-Downtime Migrations
Traditional migrations require downtime. Zero-downtime migrations use patterns like:
-- Adding a NOT NULL column with a default
ALTER TABLE users
ADD COLUMN status VARCHAR(20) DEFAULT 'active' NOT NULL;
-- Adding a column with no default (requires multi-step)
-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN nickname VARCHAR(100);
-- Step 2: Backfill data
UPDATE users SET nickname = SUBSTRING(name, 1, 20);
-- Step 3: Add constraint with check
ALTER TABLE users
ALTER COLUMN nickname SET NOT NULL;
-- Renaming safely (create new, sync, drop old)
-- Step 1: Create new column
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(10,2);
-- Step 2: Create trigger to keep in sync
CREATE OR REPLACE FUNCTION sync_order_amount()
RETURNS TRIGGER AS $$
BEGIN
NEW.total_amount = NEW.total;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_order_amount
BEFORE INSERT OR UPDATE ON orders
FOR EACH ROW EXECUTE FUNCTION sync_order_amount();
-- Step 3: After verification, drop old column
ALTER TABLE orders DROP COLUMN total;
Database CI/CD Pipeline
Testing Database Changes
# GitHub Actions - Database CI/CD
name: Database CI
on:
push:
paths:
- 'migrations/**'
jobs:
test-migrations:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start PostgreSQL
uses: supercharge/[postgres-action@v1](https://github.com/supercharge/pg-action)
with:
postgres-version: '15'
- name: Run migrations
run: |
for migration in migrations/*.sql; do
psql -U postgres -d testdb -f "$migration" || exit 1
done
- name: Smoke test
run: |
psql -U postgres -d testdb -c "SELECT 1" || exit 1
psql -U postgres -d testdb -c "SELECT COUNT(*) FROM users" || exit 1
- name: Lint SQL
run: |
docker run --rm -v $PWD:/sql ghcr.io/ericlagergren/validatesql \
-schema postgres migrations/*.sql
Integration Testing with Testcontainers
@Testcontainers
class UserRepositoryTest {
@Container
PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15")
.withDatabaseName("testdb")
.withUsername("test")
.withPassword("test");
@BeforeEach
void setUp() {
Flyway flyway = Flyway.configure()
.dataSource(
postgres.getJdbcUrl(),
postgres.getUsername(),
postgres.getPassword()
)
.load();
flyway.migrate();
}
@Test
void shouldCreateUser() {
User user = new User("[email protected]");
User saved = userRepository.save(user);
assertThat(saved.getId()).isNotNull();
assertThat(userRepository.findByEmail("[email protected]"))
.isPresent();
}
}
Backup and Recovery Automation
Automated Backup Strategies
# Kubernetes CronJob for PostgreSQL backup
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: database
spec:
schedule: "0 2 * * *" # Daily at 2 AM
successfulJobsHistoryLimit: 7
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
command:
- /bin/sh
- -c
- |
DATE=$(date +%Y%m%d_%H%M%S)
pg_dump -h postgres.default -U postgres mydb | gzip | \
aws s3 cp - s3://backups-db/postgres/mydb_${DATE}.sql.gz
volumeMounts:
- name: aws-credentials
mountPath: /root/.aws
readOnly: true
volumes:
- name: aws-credentials
secret:
secretName: aws-backup-credentials
restartPolicy: OnFailure
Point-in-Time Recovery
#!/bin/bash
# restore-pitr.sh - Point-in-Time Recovery Script
set -e
BACKUP_DIR="s3://backups-db/postgres"
TARGET_TIME="2024-01-15 14:30:00"
# Get the base backup before target time
LATEST_BACKUP=$(aws s3 ls $BACKUP_DIR/ | grep "mydb_" | sort | tail -n 1 | awk '{print $4}')
echo "Restoring from backup: $LATEST_BACKUP"
# Stop PostgreSQL
pg_ctl stop -D /var/lib/postgresql/data
# Clear data directory
rm -rf /var/lib/postgresql/data/*
# Restore base backup
aws s3 cp $BACKUP_DIR/$LATEST_BACKUP - | tar -xzf - -C /var/lib/postgresql/data
# Configure PITR
cat >> /var/lib/postgresql/data/postgresql.conf << EOF
restore_command = 'aws s3 cp $BACKUP_DIR/wal/%f - > %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF
# Create recovery signal file
touch /var/lib/postgresql/data/recovery.signal
# Start PostgreSQL
pg_ctl start -D /var/lib/postgresql/data
echo "PITR restoration initiated. Database will be available when recovery completes."
Backup Testing
# backup_test.py - Automated backup verification
import subprocess
import tempfile
import psycopg2
def test_backup_restore():
backup_file = "s3://backups-db/postgres/mydb_latest.sql.gz"
with tempfile.TemporaryDirectory() as tmpdir:
# Download backup
subprocess.run([
"aws", "s3", "cp", backup_file, f"{tmpdir}/backup.sql.gz"
], check=True)
# Extract
subprocess.run([
"gunzip", f"{tmpdir}/backup.sql.gz"
], check=True)
# Create test database
conn = psycopg2.connect(
host="postgres-test",
database="postgres",
user="test",
password="test"
)
conn.autocommit = True
cur = conn.cursor()
cur.execute("DROP DATABASE IF EXISTS restore_test")
cur.execute("CREATE DATABASE restore_test")
cur.close()
conn.close()
# Restore to test database
subprocess.run([
"psql", "-h", "postgres-test", "-d", "restore_test",
"-f", f"{tmpdir}/backup.sql"
], check=True)
# Verify critical tables exist
conn = psycopg2.connect(
host="postgres-test",
database="restore_test",
user="test",
password="test"
)
cur = conn.cursor()
# Check tables
cur.execute("""
SELECT table_name FROM information_schema.tables
WHERE table_schema = 'public'
""")
tables = [row[0] for row in cur.fetchall()]
assert 'users' in tables, "users table missing"
assert 'orders' in tables, "orders table missing"
# Check row counts
cur.execute("SELECT COUNT(*) FROM users")
user_count = cur.fetchone()[0]
assert user_count > 0, "No users in backup"
print(f"โ Backup verified: {len(tables)} tables, {user_count} users")
Replication and High Availability
PostgreSQL Streaming Replication
# postgresql.conf - Primary
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB
# postgresql.conf - Replica
hot_standby = on
primary_conninfo = 'host=primary-db port=5432 user=replica password=secret'
-- Create replication user
CREATE USER replica WITH REPLICATION PASSWORD 'secret';
-- Create replication slot
SELECT * FROM pg_create_physical_replication_slot('replication_slot_1');
-- Check replication status
SELECT * FROM pg_stat_replication;
Automatic Failover with Patroni
# patroni.yml - HA configuration
scope: postgres-cluster
namespace: /service
name: postgresql-0
restapi:
listen: 0.0.0.0:8008
connect_address: postgresql-0.example.com:8008
etcd:
hosts: etcd-0.example.com:2379,etcd-1.example.com:2379,etcd-2.example.com:2379
postgresql:
listen: 0.0.0.0:5432
connect_address: postgresql-0.example.com:5432
data_dir: /data/postgresql
parameters:
max_connections: 100
shared_buffers: 128MB
wal_level: replica
hot_standby: "on"
max_wal_senders: 5
wal_keep_size: 1GB
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
primary_start_timeout: 30
synchronous_mode: true
members:
- name: postgresql-0
role: master
- name: postgresql-1
role: replica
- name: postgresql-2
role: replica
Performance Optimization
Connection Pooling
# PgBouncer configuration
[databases]
mydb = host=postgres.default port=5432 dbname=mydb
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 200
default_pool_size = 25
min_pool_size = 10
reserve_pool_size = 5
reserve_pool_timeout = 5
Query Optimization
-- Analyze slow queries
SELECT
query,
calls,
mean_time,
total_time,
rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Create index for frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_user_id
ON orders(user_id)
WHERE status = 'pending';
-- Use partial indexes
CREATE INDEX idx_active_users ON users(email)
WHERE deleted_at IS NULL;
Monitoring and Observability
Database Metrics
# Prometheus PostgreSQL exporter
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-exporter
data:
config.yaml: |
datasources:
- url: postgres://postgres:[email protected]:5432/mydb?sslmode=disable
queries:
- pg_stat_database:
metrics:
- blks_read
- blks_hit
- tup_inserted
- tup_updated
- tup_deleted
- conflicts
- temp_files
- temp_bytes
- deadlocks
- pg_stat_user_tables:
metrics:
- seq_scan
- seq_tup_read
- idx_scan
- idx_tup_fetch
- n_tup_ins
- n_tup_upd
- n_tup_del
Alerting Rules
groups:
- name: postgres
rules:
- alert: HighConnections
expr: pg_stat_database_numbackends / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL approaching connection limit"
- alert: SlowQueries
expr: rate(pg_stat_statements_mean_time[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Slow queries detected"
- alert: ReplicationLag
expr: pg_replication_lag > 30
for: 1m
labels:
severity: critical
annotations:
summary: "Replication lag is critical"
Database Migration Patterns
Blue-Green Database Deployment
#!/bin/bash
# blue-green-migration.sh
BLUE_DB="mydb-blue"
GREEN_DB="mydb-green"
ACTIVE_DB=$(aws rds describe-db-instances \
--db-instance-identifier mydb \
--query 'DBInstances[0].Tags[?Key==`active`].Value' \
--output text)
NEW_DB="${GREEN_DB:-$BLUE_DB}"
# Create new database
aws rds create-db-instance \
--db-instance-identifier "$NEW_DB" \
--db-instance-class db.t3.medium \
--engine postgres \
--allocated-storage 100
# Run migrations on new database
echo "Running migrations on $NEW_DB..."
flyway migrate -url="jdbc:postgresql://$NEW_DB...."
# Test new database
echo "Running smoke tests..."
./smoke-test.sh "$NEW_DB"
# Switch connection string
aws rds add-tags-to-db-resource \
--db-instance-identifier mydb \
--tags Key=active,Value="$NEW_DB"
echo "Migration complete. Active database: $NEW_DB"
Conclusion
Database DevOps transforms databases from deployment bottlenecks into reliable, automatable components. The key practices:
- Version control all schema changes
- Automate migrations with proper testing
- Implement robust backup and recovery
- Configure high availability with automatic failover
- Monitor performance and set alerts
- Practice migrations regularly
Start small - pick one database, implement proper migrations and backups, then expand to other databases.
External Resources
Related Articles
- Kubernetes at Scale - Running databases on Kubernetes
- Observability Stack - Database monitoring
- Disaster Recovery Automation - DR strategies
Comments