Database DevOps: Automation, Migration, and Operations

Databases are the backbone of most applications, yet they’re often the hardest part to automate and operate. Database DevOps bridges the gap between rapid development and stable operations - enabling teams to ship features faster while maintaining data integrity and system reliability.

The Database Challenge

Application development has embraced CI/CD, microservices, and agile methodologies. Databases remain the bottleneck:

Schema changes require careful planning
Data migration is risky and complex
Downtime for migrations affects users
Rollbacks are difficult and dangerous
Performance tuning requires expertise

Database DevOps addresses these challenges through automation, testing, and operational excellence.

Schema Migration Automation

Version Control for Database Schema

Treat database schema like application code:

# Directory structure for migrations
migrations/
├── 001_create_users.sql
├── 002_add_orders_table.sql
├── 003_add_user_email_index.sql
└── down/
    ├── 003_add_user_email_index.sql
    ├── 002_add_orders_table.sql
    └── 001_create_users.sql

Migration Tools

Flyway (Java-based, works everywhere):

<!-- pom.xml configuration -->
<configuration>
    <url>jdbc:postgresql://localhost:5432/mydb</url>
    <user>dbuser</user>
    <password>secret</password>
    <locations>
        <location>filesystem:src/main/resources/db/migration</location>
    </locations>
</configuration>

-- V1__Create_users_table.sql
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- V2__Add_orders.sql
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    total DECIMAL(10,2) NOT NULL,
    status VARCHAR(50) DEFAULT 'pending'
);

Liquibase (XML/YAML-based, rollback support):

# databasechangelog.yaml
databaseChangeLog:
  - changeSet:
      id: create-users-table
      author: devops
      changes:
        - createTable:
            tableName: users
            columns:
              - column:
                  name: id
                  type: INT
                  autoIncrement: true
                  constraints:
                    primaryKey: true
              - column:
                  name: email
                  type: VARCHAR(255)
                  constraints:
                    unique: true
                    nullable: false

sqitch (Perl-based, agnostic):

# Initialize sqitch
sqitch init mydb --engine pg

# Add a migration
sqitch add v1 -n "Create users table"

# Deploy
sqitch deploy db:pg://user:pass@localhost/mydb

# Revert
sqitch revert db:pg://user:pass@localhost/mydb

Zero-Downtime Migrations

Traditional migrations require downtime. Zero-downtime migrations use patterns like:

-- Adding a NOT NULL column with a default
ALTER TABLE users 
ADD COLUMN status VARCHAR(20) DEFAULT 'active' NOT NULL;

-- Adding a column with no default (requires multi-step)
-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN nickname VARCHAR(100);

-- Step 2: Backfill data
UPDATE users SET nickname = SUBSTRING(name, 1, 20);

-- Step 3: Add constraint with check
ALTER TABLE users 
ALTER COLUMN nickname SET NOT NULL;

-- Renaming safely (create new, sync, drop old)
-- Step 1: Create new column
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(10,2);

-- Step 2: Create trigger to keep in sync
CREATE OR REPLACE FUNCTION sync_order_amount()
RETURNS TRIGGER AS $$
BEGIN
    NEW.total_amount = NEW.total;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER sync_order_amount
BEFORE INSERT OR UPDATE ON orders
FOR EACH ROW EXECUTE FUNCTION sync_order_amount();

-- Step 3: After verification, drop old column
ALTER TABLE orders DROP COLUMN total;

Database CI/CD Pipeline

Testing Database Changes

# GitHub Actions - Database CI/CD
name: Database CI

on:
  push:
    paths:
      - 'migrations/**'

jobs:
  test-migrations:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Start PostgreSQL
        uses: supercharge/[postgres-action@v1](https://github.com/supercharge/pg-action)
        with:
          postgres-version: '15'
      
      - name: Run migrations
        run: |
          for migration in migrations/*.sql; do
            psql -U postgres -d testdb -f "$migration" || exit 1
          done
      
      - name: Smoke test
        run: |
          psql -U postgres -d testdb -c "SELECT 1" || exit 1
          psql -U postgres -d testdb -c "SELECT COUNT(*) FROM users" || exit 1
      
      - name: Lint SQL
        run: |
          docker run --rm -v $PWD:/sql ghcr.io/ericlagergren/validatesql \
            -schema postgres migrations/*.sql

Integration Testing with Testcontainers

@Testcontainers
class UserRepositoryTest {
    
    @Container
    PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15")
        .withDatabaseName("testdb")
        .withUsername("test")
        .withPassword("test");
    
    @BeforeEach
    void setUp() {
        Flyway flyway = Flyway.configure()
            .dataSource(
                postgres.getJdbcUrl(),
                postgres.getUsername(),
                postgres.getPassword()
            )
            .load();
        flyway.migrate();
    }
    
    @Test
    void shouldCreateUser() {
        User user = new User("[email protected]");
        User saved = userRepository.save(user);
        
        assertThat(saved.getId()).isNotNull();
        assertThat(userRepository.findByEmail("[email protected]"))
            .isPresent();
    }
}

Backup and Recovery Automation

Automated Backup Strategies

# Kubernetes CronJob for PostgreSQL backup
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: database
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  successfulJobsHistoryLimit: 7
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: password
              command:
                - /bin/sh
                - -c
                - |
                  DATE=$(date +%Y%m%d_%H%M%S)
                  pg_dump -h postgres.default -U postgres mydb | gzip | \
                    aws s3 cp - s3://backups-db/postgres/mydb_${DATE}.sql.gz
              volumeMounts:
                - name: aws-credentials
                  mountPath: /root/.aws
                  readOnly: true
          volumes:
            - name: aws-credentials
              secret:
                secretName: aws-backup-credentials
          restartPolicy: OnFailure

Point-in-Time Recovery

#!/bin/bash
# restore-pitr.sh - Point-in-Time Recovery Script

set -e

BACKUP_DIR="s3://backups-db/postgres"
TARGET_TIME="2024-01-15 14:30:00"

# Get the base backup before target time
LATEST_BACKUP=$(aws s3 ls $BACKUP_DIR/ | grep "mydb_" | sort | tail -n 1 | awk '{print $4}')

echo "Restoring from backup: $LATEST_BACKUP"

# Stop PostgreSQL
pg_ctl stop -D /var/lib/postgresql/data

# Clear data directory
rm -rf /var/lib/postgresql/data/*

# Restore base backup
aws s3 cp $BACKUP_DIR/$LATEST_BACKUP - | tar -xzf - -C /var/lib/postgresql/data

# Configure PITR
cat >> /var/lib/postgresql/data/postgresql.conf << EOF
restore_command = 'aws s3 cp $BACKUP_DIR/wal/%f - > %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF

# Create recovery signal file
touch /var/lib/postgresql/data/recovery.signal

# Start PostgreSQL
pg_ctl start -D /var/lib/postgresql/data

echo "PITR restoration initiated. Database will be available when recovery completes."

Backup Testing

# backup_test.py - Automated backup verification
import subprocess
import tempfile
import psycopg2

def test_backup_restore():
    backup_file = "s3://backups-db/postgres/mydb_latest.sql.gz"
    
    with tempfile.TemporaryDirectory() as tmpdir:
        # Download backup
        subprocess.run([
            "aws", "s3", "cp", backup_file, f"{tmpdir}/backup.sql.gz"
        ], check=True)
        
        # Extract
        subprocess.run([
            "gunzip", f"{tmpdir}/backup.sql.gz"
        ], check=True)
        
        # Create test database
        conn = psycopg2.connect(
            host="postgres-test",
            database="postgres",
            user="test",
            password="test"
        )
        conn.autocommit = True
        cur = conn.cursor()
        cur.execute("DROP DATABASE IF EXISTS restore_test")
        cur.execute("CREATE DATABASE restore_test")
        cur.close()
        conn.close()
        
        # Restore to test database
        subprocess.run([
            "psql", "-h", "postgres-test", "-d", "restore_test",
            "-f", f"{tmpdir}/backup.sql"
        ], check=True)
        
        # Verify critical tables exist
        conn = psycopg2.connect(
            host="postgres-test",
            database="restore_test",
            user="test",
            password="test"
        )
        cur = conn.cursor()
        
        # Check tables
        cur.execute("""
            SELECT table_name FROM information_schema.tables 
            WHERE table_schema = 'public'
        """)
        tables = [row[0] for row in cur.fetchall()]
        
        assert 'users' in tables, "users table missing"
        assert 'orders' in tables, "orders table missing"
        
        # Check row counts
        cur.execute("SELECT COUNT(*) FROM users")
        user_count = cur.fetchone()[0]
        assert user_count > 0, "No users in backup"
        
        print(f"✓ Backup verified: {len(tables)} tables, {user_count} users")

Replication and High Availability

PostgreSQL Streaming Replication

# postgresql.conf - Primary
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

# postgresql.conf - Replica
hot_standby = on
primary_conninfo = 'host=primary-db port=5432 user=replica password=secret'

-- Create replication user
CREATE USER replica WITH REPLICATION PASSWORD 'secret';

-- Create replication slot
SELECT * FROM pg_create_physical_replication_slot('replication_slot_1');

-- Check replication status
SELECT * FROM pg_stat_replication;

Automatic Failover with Patroni

# patroni.yml - HA configuration
scope: postgres-cluster
namespace: /service
name: postgresql-0

restapi:
  listen: 0.0.0.0:8008
  connect_address: postgresql-0.example.com:8008

etcd:
  hosts: etcd-0.example.com:2379,etcd-1.example.com:2379,etcd-2.example.com:2379

postgresql:
  listen: 0.0.0.0:5432
  connect_address: postgresql-0.example.com:5432
  data_dir: /data/postgresql
  parameters:
    max_connections: 100
    shared_buffers: 128MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 5
    wal_keep_size: 1GB

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    primary_start_timeout: 30
    synchronous_mode: true
    members:
      - name: postgresql-0
        role: master
      - name: postgresql-1
        role: replica
      - name: postgresql-2
        role: replica

Performance Optimization

Connection Pooling

# PgBouncer configuration
[databases]
mydb = host=postgres.default port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 200
default_pool_size = 25
min_pool_size = 10
reserve_pool_size = 5
reserve_pool_timeout = 5

Query Optimization

-- Analyze slow queries
SELECT 
    query,
    calls,
    mean_time,
    total_time,
    rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Create index for frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_user_id 
ON orders(user_id) 
WHERE status = 'pending';

-- Use partial indexes
CREATE INDEX idx_active_users ON users(email) 
WHERE deleted_at IS NULL;

Monitoring and Observability

Database Metrics

# Prometheus PostgreSQL exporter
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-exporter
data:
  config.yaml: |
    datasources:
      - url: postgres://postgres:[email protected]:5432/mydb?sslmode=disable
        queries:
          - pg_stat_database:
              metrics:
                - blks_read
                - blks_hit
                - tup_inserted
                - tup_updated
                - tup_deleted
                - conflicts
                - temp_files
                - temp_bytes
                - deadlocks
          - pg_stat_user_tables:
              metrics:
                - seq_scan
                - seq_tup_read
                - idx_scan
                - idx_tup_fetch
                - n_tup_ins
                - n_tup_upd
                - n_tup_del

Alerting Rules

groups:
  - name: postgres
    rules:
      - alert: HighConnections
        expr: pg_stat_database_numbackends / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL approaching connection limit"
          
      - alert: SlowQueries
        expr: rate(pg_stat_statements_mean_time[5m]) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow queries detected"
          
      - alert: ReplicationLag
        expr: pg_replication_lag > 30
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag is critical"

Database Migration Patterns

Blue-Green Database Deployment

#!/bin/bash
# blue-green-migration.sh

BLUE_DB="mydb-blue"
GREEN_DB="mydb-green"
ACTIVE_DB=$(aws rds describe-db-instances \
    --db-instance-identifier mydb \
    --query 'DBInstances[0].Tags[?Key==`active`].Value' \
    --output text)

NEW_DB="${GREEN_DB:-$BLUE_DB}"

# Create new database
aws rds create-db-instance \
    --db-instance-identifier "$NEW_DB" \
    --db-instance-class db.t3.medium \
    --engine postgres \
    --allocated-storage 100

# Run migrations on new database
echo "Running migrations on $NEW_DB..."
flyway migrate -url="jdbc:postgresql://$NEW_DB...."

# Test new database
echo "Running smoke tests..."
./smoke-test.sh "$NEW_DB"

# Switch connection string
aws rds add-tags-to-db-resource \
    --db-instance-identifier mydb \
    --tags Key=active,Value="$NEW_DB"

echo "Migration complete. Active database: $NEW_DB"

Conclusion

Database DevOps transforms databases from deployment bottlenecks into reliable, automatable components. The key practices:

Version control all schema changes
Automate migrations with proper testing
Implement robust backup and recovery
Configure high availability with automatic failover
Monitor performance and set alerts
Practice migrations regularly

Start small - pick one database, implement proper migrations and backups, then expand to other databases.

External Resources

Kubernetes at Scale - Running databases on Kubernetes
Observability Stack - Database monitoring
Disaster Recovery Automation - DR strategies