Skip to main content
โšก Calmops

Monitoring and Logging in Production: Building Observable Systems

Monitoring and Logging in Production: Building Observable Systems

In production environments, things go wrong. Services fail, performance degrades, and users experience issues. The difference between a team that responds to problems in minutes versus hours often comes down to one thing: observability.

Observabilityโ€”the ability to understand what’s happening inside your systemsโ€”depends on three pillars: metrics, logs, and traces. Without proper monitoring and logging, you’re flying blind. You’ll spend hours debugging issues that could have been identified in seconds with the right visibility.

In this comprehensive guide, we’ll explore how to build observable systems through effective monitoring and logging strategies.


Monitoring vs Logging: Understanding the Difference

While often mentioned together, monitoring and logging serve distinct purposes:

Monitoring

Monitoring is the continuous collection and analysis of system metrics to understand health and performance.

Characteristics:

  • Focuses on quantitative data (numbers, measurements)
  • Real-time or near-real-time
  • Aggregated and summarized
  • Used for alerting and dashboards
  • Examples: CPU usage, response time, error rate

Example:

CPU Usage: 75%
Memory Usage: 82%
Request Latency (p95): 245ms
Error Rate: 0.5%
Requests Per Second: 1,250

Logging

Logging is the recording of discrete events that occur in your system.

Characteristics:

  • Focuses on qualitative data (events, messages)
  • Detailed and verbose
  • Stored for later analysis
  • Used for debugging and auditing
  • Examples: Application errors, user actions, system events

Example:

2025-01-15 10:30:45 ERROR [auth-service] Failed login attempt for user [email protected]: Invalid password
2025-01-15 10:30:46 INFO [payment-service] Payment processed: order_id=12345, amount=$99.99, status=success
2025-01-15 10:30:47 WARN [database] Slow query detected: SELECT * FROM users WHERE status='active' took 2.5s

Why You Need Both

  • Monitoring tells you that something is wrong
  • Logging tells you why something is wrong

Together, they enable rapid problem identification and resolution.


Types of Monitoring

1. Infrastructure Monitoring

Monitor the underlying systems and resources:

# Example: Prometheus configuration for infrastructure monitoring
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:9323']

Key metrics:

  • CPU usage and load
  • Memory usage and swap
  • Disk space and I/O
  • Network bandwidth
  • Container/process health

2. Application Monitoring

Monitor application-level performance and behavior:

# Example: Application metrics with Prometheus
from prometheus_client import Counter, Histogram, Gauge
import time

# Counter: Total requests
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram: Request latency
request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

# Gauge: Active connections
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

@app.route('/api/users')
def get_users():
    start_time = time.time()
    
    try:
        users = fetch_users()
        request_count.labels(method='GET', endpoint='/api/users', status=200).inc()
        request_latency.labels(method='GET', endpoint='/api/users').observe(
            time.time() - start_time
        )
        return jsonify(users)
    except Exception as e:
        request_count.labels(method='GET', endpoint='/api/users', status=500).inc()
        raise

Key metrics:

  • Request rate and latency
  • Error rates
  • Business metrics (conversions, transactions)
  • Resource usage (database connections, cache hits)

3. Synthetic Monitoring

Proactively test your systems from external locations:

# Example: Synthetic monitoring with requests
import requests
import time

def synthetic_test():
    """Test critical user journeys"""
    
    # Test 1: Homepage loads
    start = time.time()
    response = requests.get('https://example.com')
    latency = time.time() - start
    
    assert response.status_code == 200, "Homepage returned non-200 status"
    assert latency < 2.0, f"Homepage took {latency}s, expected < 2s"
    
    # Test 2: Login flow
    response = requests.post('https://example.com/login', json={
        'email': '[email protected]',
        'password': 'test123'
    })
    assert response.status_code == 200, "Login failed"
    
    # Test 3: API endpoint
    response = requests.get('https://api.example.com/users')
    assert response.status_code == 200, "API endpoint failed"
    
    print("All synthetic tests passed")

# Run every 5 minutes
schedule.every(5).minutes.do(synthetic_test)

Benefits:

  • Detect issues before users do
  • Monitor from multiple geographic locations
  • Test critical user journeys
  • Validate third-party dependencies

4. Real User Monitoring (RUM)

Collect performance data from actual users:

// Example: Real User Monitoring with web vitals
import {getCLS, getFID, getFCP, getLCP, getTTFB} from 'web-vitals';

getCLS(console.log);  // Cumulative Layout Shift
getFID(console.log);  // First Input Delay
getFCP(console.log);  // First Contentful Paint
getLCP(console.log);  // Largest Contentful Paint
getTTFB(console.log); // Time to First Byte

// Send to monitoring service
function sendMetric(metric) {
    fetch('/api/metrics', {
        method: 'POST',
        body: JSON.stringify(metric)
    });
}

getCLS(sendMetric);
getFID(sendMetric);

Benefits:

  • Real performance data from actual users
  • Identify geographic or device-specific issues
  • Understand user experience impact
  • Detect issues synthetic monitoring misses

Logging Best Practices

Structured Logging

Use structured, machine-readable log formats instead of unstructured text:

# โœ— Bad: Unstructured logging
logger.info(f"User {user_id} logged in from {ip_address} at {timestamp}")

# โœ“ Good: Structured logging (JSON)
import json
import logging

logger.info(json.dumps({
    'event': 'user_login',
    'user_id': user_id,
    'ip_address': ip_address,
    'timestamp': timestamp,
    'session_id': session_id
}))

# Or use a structured logging library
from pythonjsonlogger import jsonlogger

logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)

logger.info('User login', extra={
    'user_id': user_id,
    'ip_address': ip_address,
    'session_id': session_id
})

Benefits:

  • Easy to parse and search
  • Enables correlation across systems
  • Supports rich querying
  • Works well with log aggregation tools

Log Levels

Use appropriate log levels to control verbosity:

import logging

# DEBUG: Detailed information for debugging
logger.debug(f"Processing request: {request_id}")

# INFO: General informational messages
logger.info(f"User {user_id} logged in successfully")

# WARNING: Warning messages for potentially problematic situations
logger.warning(f"High memory usage detected: {memory_percent}%")

# ERROR: Error messages for serious problems
logger.error(f"Failed to connect to database: {error}")

# CRITICAL: Critical messages for very serious problems
logger.critical(f"System is shutting down due to critical error: {error}")

Best practices:

  • Use DEBUG for development only
  • Use INFO for important business events
  • Use WARNING for recoverable issues
  • Use ERROR for failures that need attention
  • Use CRITICAL sparingly for system-critical issues

What to Log

# โœ“ Good: Log important events and context
logger.info('Payment processed', extra={
    'order_id': order_id,
    'amount': amount,
    'currency': currency,
    'payment_method': payment_method,
    'status': 'success',
    'duration_ms': duration
})

# โœ“ Good: Log errors with context
try:
    process_payment(order)
except PaymentError as e:
    logger.error('Payment failed', extra={
        'order_id': order_id,
        'error_code': e.code,
        'error_message': str(e),
        'retry_count': retry_count,
        'user_id': user_id
    })

# โœ— Bad: Don't log sensitive data
logger.info(f"User password: {password}")  # Never!
logger.info(f"Credit card: {credit_card}")  # Never!

# โœ— Bad: Don't log too much
logger.debug(f"Processing item {i} of {total}")  # Too verbose

What to log:

  • Business events (purchases, logins, errors)
  • Performance metrics (latency, duration)
  • Error conditions with context
  • Security events (failed auth, permission denied)
  • System state changes

What NOT to log:

  • Passwords or API keys
  • Credit card numbers
  • Personal identification information
  • Excessive debug information in production

Key Metrics to Track

The Four Golden Signals

Google’s SRE book identifies four key metrics:

1. Latency

- Request latency (p50, p95, p99)
- Database query time
- API response time

2. Traffic

- Requests per second
- Bytes served
- Concurrent connections

3. Errors

- Error rate (5xx responses)
- Failed requests
- Exception rate

4. Saturation

- CPU usage
- Memory usage
- Disk usage
- Database connection pool utilization

Business Metrics

Track metrics that matter to your business:

# Example: E-commerce business metrics
business_metrics = {
    'conversion_rate': 0.032,           # 3.2% of visitors convert
    'average_order_value': 87.50,       # Average order value
    'cart_abandonment_rate': 0.68,      # 68% of carts abandoned
    'customer_acquisition_cost': 25.00, # Cost to acquire customer
    'customer_lifetime_value': 450.00,  # Expected lifetime value
    'payment_success_rate': 0.98,       # 98% of payments succeed
}

Alerting Strategies

Avoid Alert Fatigue

Alert fatigue occurs when too many alerts cause teams to ignore them:

# โœ— Bad: Too many alerts
alerts:
  - name: cpu_usage_above_50
    condition: cpu_usage > 50
    severity: critical
  
  - name: memory_usage_above_60
    condition: memory_usage > 60
    severity: critical
  
  - name: disk_usage_above_70
    condition: disk_usage > 70
    severity: critical

# โœ“ Good: Meaningful alerts with appropriate thresholds
alerts:
  - name: cpu_usage_critical
    condition: cpu_usage > 90 for 5 minutes
    severity: critical
    action: page_oncall
  
  - name: error_rate_high
    condition: error_rate > 5% for 2 minutes
    severity: critical
    action: page_oncall
  
  - name: disk_usage_warning
    condition: disk_usage > 85%
    severity: warning
    action: send_slack_notification

Alert Best Practices

  • Alert on symptoms, not causes: Alert on error rate, not CPU usage
  • Use appropriate thresholds: Avoid alerting on normal fluctuations
  • Include context: Provide information to help debugging
  • Route to right team: Send alerts to teams that can act on them
  • Implement escalation: Escalate if not acknowledged
  • Review regularly: Remove alerts that never fire or always fire

Correlation: Connecting Logs, Metrics, and Traces

Effective debugging requires correlating data across systems:

# Example: Correlation IDs for tracing requests
import uuid
from flask import request, g

@app.before_request
def set_correlation_id():
    """Set correlation ID for request tracing"""
    g.correlation_id = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))

@app.after_request
def add_correlation_id_header(response):
    """Add correlation ID to response"""
    response.headers['X-Correlation-ID'] = g.correlation_id
    return response

# Use correlation ID in logs
logger.info('Processing request', extra={
    'correlation_id': g.correlation_id,
    'user_id': user_id,
    'endpoint': request.path
})

# Use correlation ID in metrics
request_count.labels(
    method=request.method,
    endpoint=request.path,
    correlation_id=g.correlation_id
).inc()

Benefits:

  • Track requests across multiple services
  • Correlate logs with metrics
  • Identify bottlenecks and failures
  • Understand end-to-end flow

Common Tools and Technologies

Monitoring Tools

  • Prometheus: Time-series database for metrics
  • Grafana: Visualization and dashboarding
  • Datadog: Cloud-based monitoring platform
  • New Relic: Application performance monitoring
  • CloudWatch: AWS monitoring service

Logging Tools

  • ELK Stack: Elasticsearch, Logstash, Kibana
  • Splunk: Enterprise logging platform
  • Loki: Log aggregation system
  • Papertrail: Cloud-based log management
  • CloudWatch Logs: AWS logging service

Distributed Tracing

  • Jaeger: Open-source distributed tracing
  • Zipkin: Distributed tracing system
  • Datadog APM: Application performance monitoring
  • AWS X-Ray: AWS distributed tracing

Cost Considerations and Data Retention

Managing Costs

# Example: Log retention policy
retention_policies:
  debug_logs:
    retention_days: 7
    storage: local
    cost_per_gb: $0.10
  
  info_logs:
    retention_days: 30
    storage: cloud
    cost_per_gb: $0.50
  
  error_logs:
    retention_days: 90
    storage: cloud
    cost_per_gb: $0.50
  
  audit_logs:
    retention_days: 365
    storage: archive
    cost_per_gb: $0.05

Cost optimization strategies:

  • Implement log sampling for high-volume services
  • Use appropriate retention periods
  • Archive old logs to cheaper storage
  • Filter unnecessary logs
  • Use compression

Security and Compliance

Protecting Log Data

# Example: Redacting sensitive data from logs
import re

def redact_sensitive_data(log_message):
    """Remove sensitive information from logs"""
    
    # Redact email addresses
    log_message = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 
                         '[REDACTED_EMAIL]', log_message)
    
    # Redact credit card numbers
    log_message = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', 
                         '[REDACTED_CC]', log_message)
    
    # Redact API keys
    log_message = re.sub(r'api[_-]?key["\']?\s*[:=]\s*["\']?[A-Za-z0-9_-]+', 
                         'api_key=[REDACTED]', log_message)
    
    return log_message

Security best practices:

  • Encrypt logs in transit and at rest
  • Implement access controls
  • Redact sensitive data
  • Audit log access
  • Comply with regulations (GDPR, HIPAA, etc.)

Conclusion

Effective monitoring and logging are essential for building reliable production systems. By implementing the strategies outlined in this guide, you’ll gain the visibility needed to detect issues quickly, debug efficiently, and maintain system reliability.

Key Takeaways

  • Implement both monitoring and logging: They serve complementary purposes
  • Use structured logging: Machine-readable logs enable better analysis
  • Track the right metrics: Focus on business and system health metrics
  • Avoid alert fatigue: Alert on symptoms, not causes
  • Correlate data: Use correlation IDs to connect logs, metrics, and traces
  • Manage costs: Implement retention policies and sampling strategies
  • Protect sensitive data: Redact and encrypt logs appropriately
  • Iterate and improve: Regularly review and refine your observability strategy

Building observable systems is an ongoing process. Start with the basics, measure what matters, and continuously improve your monitoring and logging infrastructure. Your future selfโ€”debugging a production issue at 3 AMโ€”will thank you.

Comments