Skip to main content
โšก Calmops

Database DevOps: Automation, Migration, and Operations

Databases are the backbone of most applications, yet they’re often the hardest part to automate and operate. Database DevOps bridges the gap between rapid development and stable operations - enabling teams to ship features faster while maintaining data integrity and system reliability.

The Database Challenge

Application development has embraced CI/CD, microservices, and agile methodologies. Databases remain the bottleneck:

  • Schema changes require careful planning
  • Data migration is risky and complex
  • Downtime for migrations affects users
  • Rollbacks are difficult and dangerous
  • Performance tuning requires expertise

Database DevOps addresses these challenges through automation, testing, and operational excellence.

Schema Migration Automation

Version Control for Database Schema

Treat database schema like application code:

# Directory structure for migrations
migrations/
โ”œโ”€โ”€ 001_create_users.sql
โ”œโ”€โ”€ 002_add_orders_table.sql
โ”œโ”€โ”€ 003_add_user_email_index.sql
โ””โ”€โ”€ down/
    โ”œโ”€โ”€ 003_add_user_email_index.sql
    โ”œโ”€โ”€ 002_add_orders_table.sql
    โ””โ”€โ”€ 001_create_users.sql

Migration Tools

Flyway (Java-based, works everywhere):

<!-- pom.xml configuration -->
<configuration>
    <url>jdbc:postgresql://localhost:5432/mydb</url>
    <user>dbuser</user>
    <password>secret</password>
    <locations>
        <location>filesystem:src/main/resources/db/migration</location>
    </locations>
</configuration>
-- V1__Create_users_table.sql
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- V2__Add_orders.sql
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    total DECIMAL(10,2) NOT NULL,
    status VARCHAR(50) DEFAULT 'pending'
);

Liquibase (XML/YAML-based, rollback support):

# databasechangelog.yaml
databaseChangeLog:
  - changeSet:
      id: create-users-table
      author: devops
      changes:
        - createTable:
            tableName: users
            columns:
              - column:
                  name: id
                  type: INT
                  autoIncrement: true
                  constraints:
                    primaryKey: true
              - column:
                  name: email
                  type: VARCHAR(255)
                  constraints:
                    unique: true
                    nullable: false

sqitch (Perl-based, agnostic):

# Initialize sqitch
sqitch init mydb --engine pg

# Add a migration
sqitch add v1 -n "Create users table"

# Deploy
sqitch deploy db:pg://user:pass@localhost/mydb

# Revert
sqitch revert db:pg://user:pass@localhost/mydb

Zero-Downtime Migrations

Traditional migrations require downtime. Zero-downtime migrations use patterns like:

-- Adding a NOT NULL column with a default
ALTER TABLE users 
ADD COLUMN status VARCHAR(20) DEFAULT 'active' NOT NULL;

-- Adding a column with no default (requires multi-step)
-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN nickname VARCHAR(100);

-- Step 2: Backfill data
UPDATE users SET nickname = SUBSTRING(name, 1, 20);

-- Step 3: Add constraint with check
ALTER TABLE users 
ALTER COLUMN nickname SET NOT NULL;

-- Renaming safely (create new, sync, drop old)
-- Step 1: Create new column
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(10,2);

-- Step 2: Create trigger to keep in sync
CREATE OR REPLACE FUNCTION sync_order_amount()
RETURNS TRIGGER AS $$
BEGIN
    NEW.total_amount = NEW.total;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER sync_order_amount
BEFORE INSERT OR UPDATE ON orders
FOR EACH ROW EXECUTE FUNCTION sync_order_amount();

-- Step 3: After verification, drop old column
ALTER TABLE orders DROP COLUMN total;

Database CI/CD Pipeline

Testing Database Changes

# GitHub Actions - Database CI/CD
name: Database CI

on:
  push:
    paths:
      - 'migrations/**'

jobs:
  test-migrations:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Start PostgreSQL
        uses: supercharge/[postgres-action@v1](https://github.com/supercharge/pg-action)
        with:
          postgres-version: '15'
      
      - name: Run migrations
        run: |
          for migration in migrations/*.sql; do
            psql -U postgres -d testdb -f "$migration" || exit 1
          done
      
      - name: Smoke test
        run: |
          psql -U postgres -d testdb -c "SELECT 1" || exit 1
          psql -U postgres -d testdb -c "SELECT COUNT(*) FROM users" || exit 1
      
      - name: Lint SQL
        run: |
          docker run --rm -v $PWD:/sql ghcr.io/ericlagergren/validatesql \
            -schema postgres migrations/*.sql

Integration Testing with Testcontainers

@Testcontainers
class UserRepositoryTest {
    
    @Container
    PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15")
        .withDatabaseName("testdb")
        .withUsername("test")
        .withPassword("test");
    
    @BeforeEach
    void setUp() {
        Flyway flyway = Flyway.configure()
            .dataSource(
                postgres.getJdbcUrl(),
                postgres.getUsername(),
                postgres.getPassword()
            )
            .load();
        flyway.migrate();
    }
    
    @Test
    void shouldCreateUser() {
        User user = new User("[email protected]");
        User saved = userRepository.save(user);
        
        assertThat(saved.getId()).isNotNull();
        assertThat(userRepository.findByEmail("[email protected]"))
            .isPresent();
    }
}

Backup and Recovery Automation

Automated Backup Strategies

# Kubernetes CronJob for PostgreSQL backup
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: database
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  successfulJobsHistoryLimit: 7
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15
              env:
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: password
              command:
                - /bin/sh
                - -c
                - |
                  DATE=$(date +%Y%m%d_%H%M%S)
                  pg_dump -h postgres.default -U postgres mydb | gzip | \
                    aws s3 cp - s3://backups-db/postgres/mydb_${DATE}.sql.gz
              volumeMounts:
                - name: aws-credentials
                  mountPath: /root/.aws
                  readOnly: true
          volumes:
            - name: aws-credentials
              secret:
                secretName: aws-backup-credentials
          restartPolicy: OnFailure

Point-in-Time Recovery

#!/bin/bash
# restore-pitr.sh - Point-in-Time Recovery Script

set -e

BACKUP_DIR="s3://backups-db/postgres"
TARGET_TIME="2024-01-15 14:30:00"

# Get the base backup before target time
LATEST_BACKUP=$(aws s3 ls $BACKUP_DIR/ | grep "mydb_" | sort | tail -n 1 | awk '{print $4}')

echo "Restoring from backup: $LATEST_BACKUP"

# Stop PostgreSQL
pg_ctl stop -D /var/lib/postgresql/data

# Clear data directory
rm -rf /var/lib/postgresql/data/*

# Restore base backup
aws s3 cp $BACKUP_DIR/$LATEST_BACKUP - | tar -xzf - -C /var/lib/postgresql/data

# Configure PITR
cat >> /var/lib/postgresql/data/postgresql.conf << EOF
restore_command = 'aws s3 cp $BACKUP_DIR/wal/%f - > %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF

# Create recovery signal file
touch /var/lib/postgresql/data/recovery.signal

# Start PostgreSQL
pg_ctl start -D /var/lib/postgresql/data

echo "PITR restoration initiated. Database will be available when recovery completes."

Backup Testing

# backup_test.py - Automated backup verification
import subprocess
import tempfile
import psycopg2

def test_backup_restore():
    backup_file = "s3://backups-db/postgres/mydb_latest.sql.gz"
    
    with tempfile.TemporaryDirectory() as tmpdir:
        # Download backup
        subprocess.run([
            "aws", "s3", "cp", backup_file, f"{tmpdir}/backup.sql.gz"
        ], check=True)
        
        # Extract
        subprocess.run([
            "gunzip", f"{tmpdir}/backup.sql.gz"
        ], check=True)
        
        # Create test database
        conn = psycopg2.connect(
            host="postgres-test",
            database="postgres",
            user="test",
            password="test"
        )
        conn.autocommit = True
        cur = conn.cursor()
        cur.execute("DROP DATABASE IF EXISTS restore_test")
        cur.execute("CREATE DATABASE restore_test")
        cur.close()
        conn.close()
        
        # Restore to test database
        subprocess.run([
            "psql", "-h", "postgres-test", "-d", "restore_test",
            "-f", f"{tmpdir}/backup.sql"
        ], check=True)
        
        # Verify critical tables exist
        conn = psycopg2.connect(
            host="postgres-test",
            database="restore_test",
            user="test",
            password="test"
        )
        cur = conn.cursor()
        
        # Check tables
        cur.execute("""
            SELECT table_name FROM information_schema.tables 
            WHERE table_schema = 'public'
        """)
        tables = [row[0] for row in cur.fetchall()]
        
        assert 'users' in tables, "users table missing"
        assert 'orders' in tables, "orders table missing"
        
        # Check row counts
        cur.execute("SELECT COUNT(*) FROM users")
        user_count = cur.fetchone()[0]
        assert user_count > 0, "No users in backup"
        
        print(f"โœ“ Backup verified: {len(tables)} tables, {user_count} users")

Replication and High Availability

PostgreSQL Streaming Replication

# postgresql.conf - Primary
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

# postgresql.conf - Replica
hot_standby = on
primary_conninfo = 'host=primary-db port=5432 user=replica password=secret'
-- Create replication user
CREATE USER replica WITH REPLICATION PASSWORD 'secret';

-- Create replication slot
SELECT * FROM pg_create_physical_replication_slot('replication_slot_1');

-- Check replication status
SELECT * FROM pg_stat_replication;

Automatic Failover with Patroni

# patroni.yml - HA configuration
scope: postgres-cluster
namespace: /service
name: postgresql-0

restapi:
  listen: 0.0.0.0:8008
  connect_address: postgresql-0.example.com:8008

etcd:
  hosts: etcd-0.example.com:2379,etcd-1.example.com:2379,etcd-2.example.com:2379

postgresql:
  listen: 0.0.0.0:5432
  connect_address: postgresql-0.example.com:5432
  data_dir: /data/postgresql
  parameters:
    max_connections: 100
    shared_buffers: 128MB
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 5
    wal_keep_size: 1GB

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    primary_start_timeout: 30
    synchronous_mode: true
    members:
      - name: postgresql-0
        role: master
      - name: postgresql-1
        role: replica
      - name: postgresql-2
        role: replica

Performance Optimization

Connection Pooling

# PgBouncer configuration
[databases]
mydb = host=postgres.default port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 200
default_pool_size = 25
min_pool_size = 10
reserve_pool_size = 5
reserve_pool_timeout = 5

Query Optimization

-- Analyze slow queries
SELECT 
    query,
    calls,
    mean_time,
    total_time,
    rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Create index for frequently queried columns
CREATE INDEX CONCURRENTLY idx_orders_user_id 
ON orders(user_id) 
WHERE status = 'pending';

-- Use partial indexes
CREATE INDEX idx_active_users ON users(email) 
WHERE deleted_at IS NULL;

Monitoring and Observability

Database Metrics

# Prometheus PostgreSQL exporter
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-exporter
data:
  config.yaml: |
    datasources:
      - url: postgres://postgres:[email protected]:5432/mydb?sslmode=disable
        queries:
          - pg_stat_database:
              metrics:
                - blks_read
                - blks_hit
                - tup_inserted
                - tup_updated
                - tup_deleted
                - conflicts
                - temp_files
                - temp_bytes
                - deadlocks
          - pg_stat_user_tables:
              metrics:
                - seq_scan
                - seq_tup_read
                - idx_scan
                - idx_tup_fetch
                - n_tup_ins
                - n_tup_upd
                - n_tup_del

Alerting Rules

groups:
  - name: postgres
    rules:
      - alert: HighConnections
        expr: pg_stat_database_numbackends / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL approaching connection limit"
          
      - alert: SlowQueries
        expr: rate(pg_stat_statements_mean_time[5m]) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow queries detected"
          
      - alert: ReplicationLag
        expr: pg_replication_lag > 30
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag is critical"

Database Migration Patterns

Blue-Green Database Deployment

#!/bin/bash
# blue-green-migration.sh

BLUE_DB="mydb-blue"
GREEN_DB="mydb-green"
ACTIVE_DB=$(aws rds describe-db-instances \
    --db-instance-identifier mydb \
    --query 'DBInstances[0].Tags[?Key==`active`].Value' \
    --output text)

NEW_DB="${GREEN_DB:-$BLUE_DB}"

# Create new database
aws rds create-db-instance \
    --db-instance-identifier "$NEW_DB" \
    --db-instance-class db.t3.medium \
    --engine postgres \
    --allocated-storage 100

# Run migrations on new database
echo "Running migrations on $NEW_DB..."
flyway migrate -url="jdbc:postgresql://$NEW_DB...."

# Test new database
echo "Running smoke tests..."
./smoke-test.sh "$NEW_DB"

# Switch connection string
aws rds add-tags-to-db-resource \
    --db-instance-identifier mydb \
    --tags Key=active,Value="$NEW_DB"

echo "Migration complete. Active database: $NEW_DB"

Conclusion

Database DevOps transforms databases from deployment bottlenecks into reliable, automatable components. The key practices:

  • Version control all schema changes
  • Automate migrations with proper testing
  • Implement robust backup and recovery
  • Configure high availability with automatic failover
  • Monitor performance and set alerts
  • Practice migrations regularly

Start small - pick one database, implement proper migrations and backups, then expand to other databases.

External Resources

Comments