Monitoring and Logging in Production: Building Observable Systems

In production environments, things go wrong. Services fail, performance degrades, and users experience issues. The difference between a team that responds to problems in minutes versus hours often comes down to one thing: observability.

Observability—the ability to understand what’s happening inside your systems—depends on three pillars: metrics, logs, and traces. Without proper monitoring and logging, you’re flying blind. You’ll spend hours debugging issues that could have been identified in seconds with the right visibility.

In this comprehensive guide, we’ll explore how to build observable systems through effective monitoring and logging strategies.

Monitoring vs Logging: Understanding the Difference

While often mentioned together, monitoring and logging serve distinct purposes:

Monitoring

Monitoring is the continuous collection and analysis of system metrics to understand health and performance.

Characteristics:

Focuses on quantitative data (numbers, measurements)
Real-time or near-real-time
Aggregated and summarized
Used for alerting and dashboards
Examples: CPU usage, response time, error rate

Example:

CPU Usage: 75%
Memory Usage: 82%
Request Latency (p95): 245ms
Error Rate: 0.5%
Requests Per Second: 1,250

Logging

Logging is the recording of discrete events that occur in your system.

Characteristics:

Focuses on qualitative data (events, messages)
Detailed and verbose
Stored for later analysis
Used for debugging and auditing
Examples: Application errors, user actions, system events

Example:

2025-01-15 10:30:45 ERROR [auth-service] Failed login attempt for user [email protected]: Invalid password
2025-01-15 10:30:46 INFO [payment-service] Payment processed: order_id=12345, amount=$99.99, status=success
2025-01-15 10:30:47 WARN [database] Slow query detected: SELECT * FROM users WHERE status='active' took 2.5s

Why You Need Both

Monitoring tells you that something is wrong
Logging tells you why something is wrong

Together, they enable rapid problem identification and resolution.

Types of Monitoring

1. Infrastructure Monitoring

Monitor the underlying systems and resources:

# Example: Prometheus configuration for infrastructure monitoring
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:9323']

Key metrics:

CPU usage and load
Memory usage and swap
Disk space and I/O
Network bandwidth
Container/process health

2. Application Monitoring

Monitor application-level performance and behavior:

# Example: Application metrics with Prometheus
from prometheus_client import Counter, Histogram, Gauge
import time

# Counter: Total requests
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram: Request latency
request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

# Gauge: Active connections
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

@app.route('/api/users')
def get_users():
    start_time = time.time()
    
    try:
        users = fetch_users()
        request_count.labels(method='GET', endpoint='/api/users', status=200).inc()
        request_latency.labels(method='GET', endpoint='/api/users').observe(
            time.time() - start_time
        )
        return jsonify(users)
    except Exception as e:
        request_count.labels(method='GET', endpoint='/api/users', status=500).inc()
        raise

Key metrics:

Request rate and latency
Error rates
Business metrics (conversions, transactions)
Resource usage (database connections, cache hits)

3. Synthetic Monitoring

Proactively test your systems from external locations:

# Example: Synthetic monitoring with requests
import requests
import time

def synthetic_test():
    """Test critical user journeys"""
    
    # Test 1: Homepage loads
    start = time.time()
    response = requests.get('https://example.com')
    latency = time.time() - start
    
    assert response.status_code == 200, "Homepage returned non-200 status"
    assert latency < 2.0, f"Homepage took {latency}s, expected < 2s"
    
    # Test 2: Login flow
    response = requests.post('https://example.com/login', json={
        'email': '[email protected]',
        'password': 'test123'
    })
    assert response.status_code == 200, "Login failed"
    
    # Test 3: API endpoint
    response = requests.get('https://api.example.com/users')
    assert response.status_code == 200, "API endpoint failed"
    
    print("All synthetic tests passed")

# Run every 5 minutes
schedule.every(5).minutes.do(synthetic_test)

Benefits:

Detect issues before users do
Monitor from multiple geographic locations
Test critical user journeys
Validate third-party dependencies

4. Real User Monitoring (RUM)

Collect performance data from actual users:

// Example: Real User Monitoring with web vitals
import {getCLS, getFID, getFCP, getLCP, getTTFB} from 'web-vitals';

getCLS(console.log);  // Cumulative Layout Shift
getFID(console.log);  // First Input Delay
getFCP(console.log);  // First Contentful Paint
getLCP(console.log);  // Largest Contentful Paint
getTTFB(console.log); // Time to First Byte

// Send to monitoring service
function sendMetric(metric) {
    fetch('/api/metrics', {
        method: 'POST',
        body: JSON.stringify(metric)
    });
}

getCLS(sendMetric);
getFID(sendMetric);

Benefits:

Real performance data from actual users
Identify geographic or device-specific issues
Understand user experience impact
Detect issues synthetic monitoring misses

Logging Best Practices

Structured Logging

Use structured, machine-readable log formats instead of unstructured text:

# ✗ Bad: Unstructured logging
logger.info(f"User {user_id} logged in from {ip_address} at {timestamp}")

# ✓ Good: Structured logging (JSON)
import json
import logging

logger.info(json.dumps({
    'event': 'user_login',
    'user_id': user_id,
    'ip_address': ip_address,
    'timestamp': timestamp,
    'session_id': session_id
}))

# Or use a structured logging library
from pythonjsonlogger import jsonlogger

logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)

logger.info('User login', extra={
    'user_id': user_id,
    'ip_address': ip_address,
    'session_id': session_id
})

Benefits:

Easy to parse and search
Enables correlation across systems
Supports rich querying
Works well with log aggregation tools

Log Levels

Use appropriate log levels to control verbosity:

import logging

# DEBUG: Detailed information for debugging
logger.debug(f"Processing request: {request_id}")

# INFO: General informational messages
logger.info(f"User {user_id} logged in successfully")

# WARNING: Warning messages for potentially problematic situations
logger.warning(f"High memory usage detected: {memory_percent}%")

# ERROR: Error messages for serious problems
logger.error(f"Failed to connect to database: {error}")

# CRITICAL: Critical messages for very serious problems
logger.critical(f"System is shutting down due to critical error: {error}")

Best practices:

Use DEBUG for development only
Use INFO for important business events
Use WARNING for recoverable issues
Use ERROR for failures that need attention
Use CRITICAL sparingly for system-critical issues

What to Log

# ✓ Good: Log important events and context
logger.info('Payment processed', extra={
    'order_id': order_id,
    'amount': amount,
    'currency': currency,
    'payment_method': payment_method,
    'status': 'success',
    'duration_ms': duration
})

# ✓ Good: Log errors with context
try:
    process_payment(order)
except PaymentError as e:
    logger.error('Payment failed', extra={
        'order_id': order_id,
        'error_code': e.code,
        'error_message': str(e),
        'retry_count': retry_count,
        'user_id': user_id
    })

# ✗ Bad: Don't log sensitive data
logger.info(f"User password: {password}")  # Never!
logger.info(f"Credit card: {credit_card}")  # Never!

# ✗ Bad: Don't log too much
logger.debug(f"Processing item {i} of {total}")  # Too verbose

What to log:

Business events (purchases, logins, errors)
Performance metrics (latency, duration)
Error conditions with context
Security events (failed auth, permission denied)
System state changes

What NOT to log:

Passwords or API keys
Credit card numbers
Personal identification information
Excessive debug information in production

Key Metrics to Track

The Four Golden Signals

Google’s SRE book identifies four key metrics:

1. Latency

- Request latency (p50, p95, p99)
- Database query time
- API response time

2. Traffic

- Requests per second
- Bytes served
- Concurrent connections

3. Errors

- Error rate (5xx responses)
- Failed requests
- Exception rate

4. Saturation

- CPU usage
- Memory usage
- Disk usage
- Database connection pool utilization

Business Metrics

Track metrics that matter to your business:

# Example: E-commerce business metrics
business_metrics = {
    'conversion_rate': 0.032,           # 3.2% of visitors convert
    'average_order_value': 87.50,       # Average order value
    'cart_abandonment_rate': 0.68,      # 68% of carts abandoned
    'customer_acquisition_cost': 25.00, # Cost to acquire customer
    'customer_lifetime_value': 450.00,  # Expected lifetime value
    'payment_success_rate': 0.98,       # 98% of payments succeed
}

Alerting Strategies

Avoid Alert Fatigue

Alert fatigue occurs when too many alerts cause teams to ignore them:

# ✗ Bad: Too many alerts
alerts:
  - name: cpu_usage_above_50
    condition: cpu_usage > 50
    severity: critical
  
  - name: memory_usage_above_60
    condition: memory_usage > 60
    severity: critical
  
  - name: disk_usage_above_70
    condition: disk_usage > 70
    severity: critical

# ✓ Good: Meaningful alerts with appropriate thresholds
alerts:
  - name: cpu_usage_critical
    condition: cpu_usage > 90 for 5 minutes
    severity: critical
    action: page_oncall
  
  - name: error_rate_high
    condition: error_rate > 5% for 2 minutes
    severity: critical
    action: page_oncall
  
  - name: disk_usage_warning
    condition: disk_usage > 85%
    severity: warning
    action: send_slack_notification

Alert Best Practices

Alert on symptoms, not causes: Alert on error rate, not CPU usage
Use appropriate thresholds: Avoid alerting on normal fluctuations
Include context: Provide information to help debugging
Route to right team: Send alerts to teams that can act on them
Implement escalation: Escalate if not acknowledged
Review regularly: Remove alerts that never fire or always fire

Correlation: Connecting Logs, Metrics, and Traces

Effective debugging requires correlating data across systems:

# Example: Correlation IDs for tracing requests
import uuid
from flask import request, g

@app.before_request
def set_correlation_id():
    """Set correlation ID for request tracing"""
    g.correlation_id = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))

@app.after_request
def add_correlation_id_header(response):
    """Add correlation ID to response"""
    response.headers['X-Correlation-ID'] = g.correlation_id
    return response

# Use correlation ID in logs
logger.info('Processing request', extra={
    'correlation_id': g.correlation_id,
    'user_id': user_id,
    'endpoint': request.path
})

# Use correlation ID in metrics
request_count.labels(
    method=request.method,
    endpoint=request.path,
    correlation_id=g.correlation_id
).inc()

Benefits:

Track requests across multiple services
Correlate logs with metrics
Identify bottlenecks and failures
Understand end-to-end flow

Common Tools and Technologies

Monitoring Tools

Prometheus: Time-series database for metrics
Grafana: Visualization and dashboarding
Datadog: Cloud-based monitoring platform
New Relic: Application performance monitoring
CloudWatch: AWS monitoring service

Logging Tools

ELK Stack: Elasticsearch, Logstash, Kibana
Splunk: Enterprise logging platform
Loki: Log aggregation system
Papertrail: Cloud-based log management
CloudWatch Logs: AWS logging service

Distributed Tracing

Jaeger: Open-source distributed tracing
Zipkin: Distributed tracing system
Datadog APM: Application performance monitoring
AWS X-Ray: AWS distributed tracing

Cost Considerations and Data Retention

Managing Costs

# Example: Log retention policy
retention_policies:
  debug_logs:
    retention_days: 7
    storage: local
    cost_per_gb: $0.10
  
  info_logs:
    retention_days: 30
    storage: cloud
    cost_per_gb: $0.50
  
  error_logs:
    retention_days: 90
    storage: cloud
    cost_per_gb: $0.50
  
  audit_logs:
    retention_days: 365
    storage: archive
    cost_per_gb: $0.05

Cost optimization strategies:

Implement log sampling for high-volume services
Use appropriate retention periods
Archive old logs to cheaper storage
Filter unnecessary logs
Use compression

Security and Compliance

Protecting Log Data

# Example: Redacting sensitive data from logs
import re

def redact_sensitive_data(log_message):
    """Remove sensitive information from logs"""
    
    # Redact email addresses
    log_message = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 
                         '[REDACTED_EMAIL]', log_message)
    
    # Redact credit card numbers
    log_message = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', 
                         '[REDACTED_CC]', log_message)
    
    # Redact API keys
    log_message = re.sub(r'api[_-]?key["\']?\s*[:=]\s*["\']?[A-Za-z0-9_-]+', 
                         'api_key=[REDACTED]', log_message)
    
    return log_message

Security best practices:

Encrypt logs in transit and at rest
Implement access controls
Redact sensitive data
Audit log access
Comply with regulations (GDPR, HIPAA, etc.)

Conclusion

Effective monitoring and logging are essential for building reliable production systems. By implementing the strategies outlined in this guide, you’ll gain the visibility needed to detect issues quickly, debug efficiently, and maintain system reliability.

Key Takeaways

Implement both monitoring and logging: They serve complementary purposes
Use structured logging: Machine-readable logs enable better analysis
Track the right metrics: Focus on business and system health metrics
Avoid alert fatigue: Alert on symptoms, not causes
Correlate data: Use correlation IDs to connect logs, metrics, and traces
Manage costs: Implement retention policies and sampling strategies
Protect sensitive data: Redact and encrypt logs appropriately
Iterate and improve: Regularly review and refine your observability strategy

Building observable systems is an ongoing process. Start with the basics, measure what matters, and continuously improve your monitoring and logging infrastructure. Your future self—debugging a production issue at 3 AM—will thank you.

Monitoring and Logging in Production: Building Observable Systems

Monitoring vs Logging: Understanding the Difference

Monitoring

Logging

Why You Need Both

Types of Monitoring

1. Infrastructure Monitoring

2. Application Monitoring

3. Synthetic Monitoring

4. Real User Monitoring (RUM)

Logging Best Practices

Structured Logging

Log Levels

What to Log

Key Metrics to Track

The Four Golden Signals

Business Metrics

Alerting Strategies

Avoid Alert Fatigue

Alert Best Practices

Correlation: Connecting Logs, Metrics, and Traces

Common Tools and Technologies

Monitoring Tools

Logging Tools

Distributed Tracing

Cost Considerations and Data Retention

Managing Costs

Security and Compliance

Protecting Log Data

Conclusion

Key Takeaways

Comments