Introduction
Modern cloud applications are distributed, dynamic, and complex. Traditional monitoring approaches—checking if services are up and responding—are no longer sufficient. Observability goes beyond monitoring to help understand why systems behave as they do, enabling faster troubleshooting and deeper insights.
Observability encompasses three pillars: metrics, logs, and traces. Together, they provide comprehensive visibility into system behavior. Understanding how to implement effective monitoring and observability is essential for operating reliable cloud applications.
This guide examines monitoring and observability across cloud platforms. We explore metrics collection, log aggregation, distributed tracing, alerting strategies, and building observable systems. Whether establishing monitoring from scratch or improving existing implementations, this guide provides the knowledge necessary for success.
Understanding Observability
Observability enables understanding system behavior from its outputs.
The Three Pillars
Metrics: Numerical measurements collected over time (CPU usage, request count, latency percentiles)
Logs: timestamped records of events (application events, errors, access logs)
Traces: End-to-end request paths across distributed systems (request flow through microservices)
Moving Beyond Monitoring
Traditional monitoring asks “is something wrong?” Observability asks “why is it happening?” This shift is crucial for complex distributed systems where failures can be subtle and root causes non-obvious.
Cloud-Native Monitoring Services
Amazon CloudWatch
# CloudWatch Dashboard
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "main-dashboard"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
metrics = [
["AWS/EC2", "CPUUtilization", { stat = "Average" }],
[".", "NetworkIn", { stat = "Sum" }]
]
period = 300
stat = "Average"
region = "us-east-1"
title = "EC2 Metrics"
}
},
{
type = "log"
properties = {
region = "us-east-1"
title = "Error Logs"
query = "fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"
}
}
]
})
}
Custom Metrics
# Publishing custom metrics to CloudWatch
import boto3
from datetime import datetime
cw = boto3.client('cloudwatch')
def publish_metric(namespace, metric_name, value, unit='Count', dimensions=None):
cw.put_metric_data(
Namespace=namespace,
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': unit,
'Timestamp': datetime.utcnow(),
'Dimensions': dimensions or []
}
]
)
# Usage
publish_metric(
namespace='MyApplication',
metric_name='OrderProcessingTime',
value=1.5,
unit='Seconds',
dimensions=[{'Name': 'Service', 'Value': 'orders'}]
)
Distributed Tracing
AWS X-Ray
# AWS X-Ray integration
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask import FlaskInstrumentor
# Instrument Flask app
FlaskInstrumentor().instrument_app(app)
# Custom subsegments
@xray_recorder.capture('database_query')
def get_user(user_id):
return db.query(User).filter(User.id == user_id).first()
OpenTelemetry
# OpenTelemetry setup
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracer
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
# Use in code
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
process_order(order_id)
Log Aggregation
Structured Logging
# JSON structured logging
import json
import logging
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"message": record.getMessage(),
"service": "order-service",
"environment": "production",
}
# Add exception info if present
if record.exc_info:
log_data["exception"] = self.formatException(record.exc_info)
# Add custom fields
if hasattr(record, 'user_id'):
log_data["user_id"] = record.user_id
if hasattr(record, 'request_id'):
log_data["request_id"] = record.request_id
return json.dumps(log_data)
# Usage
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
# Log with context
logger.info("Order processed", extra={"order_id": "12345", "user_id": "user1"})
Log Analysis
-- Athena query for CloudWatch Logs
SELECT
timestamp,
level,
message,
JSON_EXTRACT(message, '$.order_id') as order_id
FROM cloudwatch_logs
WHERE
timestamp BETWEEN TIMESTAMP '2026-01-01' AND TIMESTAMP '2026-01-02'
AND message LIKE '%ERROR%'
ORDER BY timestamp DESC
LIMIT 100
Alerting Strategies
Alert Design Principles
- Actionable: Every alert should require action
- No Alert Fatigue: Avoid noisy alerts
- Context: Provide context for debugging
- Severity Levels: Clear severity classification
CloudWatch Alarms
# CloudWatch Alarm
resource "aws_cloudwatch_metric_alarm" "high_latency" {
alarm_name = "high-latency-alarm"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "Latency"
namespace = "AWS/ApplicationELB"
period = 300
statistic = "p95"
threshold = 2000
alarm_description = "This alarm triggers when p95 latency exceeds 2 seconds"
dimensions = {
LoadBalancer = "app/myalb/50dc6c495c0c9188"
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
Alert Routing
# Alert routing rules
rules:
- name: critical
match:
severity: critical
route:
channel: #critical
repeat_interval: 0
- name: warning
match:
severity: warning
route:
channel: #warnings
repeat_interval: 3600
- name: info
match:
severity: info
route:
channel: #info
repeat_interval: 86400
Prometheus and Grafana
Prometheus Configuration
# Prometheus scrape configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Grafana Dashboards
{
"dashboard": {
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{path}}"
}
]
},
{
"title": "Latency p95",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
"legendFormat": "Error Rate"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.01},
{"color": "red", "value": 0.05}
]
}
}
}
}
]
}
}
Building Observable Systems
Key Principles
# Adding observability to a service
class OrderService:
def __init__(self):
self.metrics = MetricsClient()
self.tracer = Tracer()
self.logger = Logger()
def process_order(self, order_data):
# Trace the operation
with self.tracer.span("process_order") as span:
span.set_tag("order_id", order_data["id"])
try:
# Validate order
with self.tracer.span("validate_order"):
self._validate(order_data)
# Save to database
with self.tracer.span("save_order"):
order = self._save(order_data)
# Emit metric
self.metrics.increment("orders.processed", 1)
self.logger.info("Order processed",
order_id=order.id,
user_id=order.user_id)
return order
except ValidationError as e:
self.metrics.increment("orders.validation_errors", 1)
self.logger.warning("Order validation failed",
order_id=order_data.get("id"),
error=str(e))
raise
except Exception as e:
self.metrics.increment("orders.errors", 1)
self.logger.error("Order processing failed",
order_id=order_data.get("id"),
error=str(e),
traceback=True)
raise
SLO Implementation
# Service Level Objectives
slo:
name: Order Processing
description: "Reliable order processing service"
indicators:
- name: "availability"
description: "API availability"
target: 99.9
measurement:
method: "ratio"
good: "http_requests_total{status=~'2..'}"
total: "http_requests_total"
- name: "latency"
description: "p95 latency"
target: 95.0
measurement:
method: "histogram"
threshold: "2s"
metric: "http_request_duration_seconds"
- name: "errors"
description: "Error rate"
target: 99.0
measurement:
method: "ratio"
good: "http_requests_total{status=!~'5..'}"
total: "http_requests_total"
alerts:
- name: slo_violation
condition: "indicator.budget_burn_rate > 1.0"
severity: critical
Conclusion
Observability is essential for operating reliable cloud applications. The three pillars—metrics, logs, and traces—provide comprehensive visibility into system behavior. Implementing effective observability requires thoughtful instrumentation, appropriate tooling, and actionable alerting.
Key practices include instrumenting code for metrics and traces, implementing structured logging, creating meaningful alerts based on SLOs, and building dashboards that provide operational visibility. The investment in observability pays dividends through faster troubleshooting and better understanding of system behavior.
As cloud applications grow in complexity, observability becomes increasingly critical. Start with the basics, instrument progressively, and continuously improve based on operational needs.
Comments