Introduction
Traditional monitoring asks “is the system up?” Observability asks “why is it behaving this way?” The shift matters because modern distributed systems fail in ways that simple up/down checks can’t detect โ a service is running but returning wrong data, latency is high only for specific users, or a cascade of small degradations is building toward an outage.
Observability 2.0 in 2026 means: OpenTelemetry as the universal instrumentation standard, AI-assisted root cause analysis, tail-based sampling to capture what matters, and treating observability configuration as code.
The Four Signals (Beyond Three Pillars)
The classic “three pillars” (metrics, logs, traces) is now four:
| Signal | Question it answers | Tool |
|---|---|---|
| Metrics | Is the system healthy? | Prometheus, Datadog |
| Logs | What happened at 14:32:05? | Loki, Elasticsearch |
| Traces | Why did this request take 3s? | Jaeger, Tempo |
| Profiles | Which function is burning CPU? | Pyroscope, Parca |
Continuous profiling is the new addition โ it answers performance questions that metrics can’t.
OpenTelemetry: Instrument Once, Export Anywhere
OpenTelemetry (OTel) is the CNCF standard for instrumentation. Instrument your code once, send to any backend.
Auto-Instrumentation (Zero Code Changes)
# Python: auto-instrument a Flask app
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Run with auto-instrumentation
opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--logs_exporter otlp \
--exporter_otlp_endpoint http://otel-collector:4317 \
python app.py
# Node.js: auto-instrument Express
npm install @opentelemetry/auto-instrumentations-node @opentelemetry/sdk-node
# tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
# Start with: node -r ./tracing.js app.js
Manual Instrumentation for Business Events
Auto-instrumentation captures infrastructure calls. Add manual spans for business logic:
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("order-service")
def process_order(order_id: str, user_id: str):
with tracer.start_as_current_span("process_order") as span:
# Add business context as attributes
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", user_id)
span.set_attribute("order.source", "web")
try:
inventory = check_inventory(order_id)
span.set_attribute("inventory.available", inventory.available)
if not inventory.available:
span.set_status(Status(StatusCode.ERROR, "Out of stock"))
span.add_event("inventory_check_failed", {
"product_id": inventory.product_id,
"requested": inventory.requested,
"available": inventory.available,
})
raise OutOfStockError(order_id)
payment = charge_payment(order_id)
span.set_attribute("payment.transaction_id", payment.transaction_id)
span.add_event("payment_processed")
return {"status": "success", "order_id": order_id}
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
OTel Collector: The Central Hub
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
# Scrape Prometheus metrics
prometheus:
config:
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['app:8080']
processors:
# Add environment context to all telemetry
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
# Batch for efficiency
batch:
timeout: 1s
send_batch_size: 1024
# Tail-based sampling (see below)
tail_sampling:
decision_wait: 10s
policies:
- name: always-sample-errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: sample-slow-traces
type: latency
latency: { threshold_ms: 1000 }
- name: probabilistic-5pct
type: probabilistic
probabilistic: { sampling_percentage: 5 }
exporters:
# Traces โ Tempo
otlp/tempo:
endpoint: tempo:4317
tls: { insecure: true }
# Metrics โ Prometheus
prometheus:
endpoint: 0.0.0.0:8889
# Logs โ Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [resource, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [resource, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [resource, batch]
exporters: [loki]
Tail-Based Sampling: Capture What Matters
Head-based sampling (random 10%) misses most errors and slow requests. Tail-based sampling decides after the trace completes โ always capturing errors and slow traces:
Head-based (10% random):
โ 10% of normal requests
โ Might miss the one 5-second request
โ Might miss the one error
Tail-based (smart):
โ 100% of errors
โ 100% of requests > 1 second
โ 5% of normal requests
โ Much better signal-to-noise ratio
The OTel Collector config above implements this. The decision_wait: 10s means the collector buffers spans for 10 seconds before deciding whether to keep the trace.
AI-Powered Root Cause Analysis
Correlation-Based Detection
import numpy as np
from scipy import stats
class CorrelationAnalyzer:
"""Find metrics that correlate with error rate spikes."""
def __init__(self, prometheus_client):
self.prom = prometheus_client
def find_correlated_metrics(self, error_spike_time: str, window: str = "30m") -> list[dict]:
"""
When error rate spikes, find which other metrics changed at the same time.
Returns metrics sorted by correlation strength.
"""
# Get error rate time series
error_rate = self.prom.query_range(
'rate(http_requests_total{status=~"5.."}[5m])',
start=error_spike_time,
end=f"{error_spike_time}+{window}"
)
# Candidate metrics to check
candidates = [
'container_cpu_usage_seconds_total',
'container_memory_usage_bytes',
'pg_stat_activity_count',
'redis_connected_clients',
'http_request_duration_seconds',
]
correlations = []
for metric in candidates:
series = self.prom.query_range(metric, start=error_spike_time,
end=f"{error_spike_time}+{window}")
if series:
corr, p_value = stats.pearsonr(error_rate, series)
if p_value < 0.05: # statistically significant
correlations.append({
'metric': metric,
'correlation': corr,
'p_value': p_value,
'direction': 'positive' if corr > 0 else 'negative'
})
return sorted(correlations, key=lambda x: abs(x['correlation']), reverse=True)
LLM-Assisted Incident Analysis
from openai import OpenAI
client = OpenAI()
def analyze_incident(metrics_summary: dict, recent_logs: list[str], traces: list[dict]) -> str:
"""Use LLM to suggest root cause from observability data."""
prompt = f"""
You are an SRE analyzing a production incident. Here is the observability data:
METRICS (last 30 minutes):
- Error rate: {metrics_summary['error_rate']}% (baseline: {metrics_summary['baseline_error_rate']}%)
- P99 latency: {metrics_summary['p99_latency_ms']}ms (baseline: {metrics_summary['baseline_p99']}ms)
- CPU usage: {metrics_summary['cpu_percent']}%
- Memory usage: {metrics_summary['memory_percent']}%
- DB connections: {metrics_summary['db_connections']} / {metrics_summary['db_max_connections']}
RECENT ERROR LOGS (last 10):
{chr(10).join(recent_logs[:10])}
SLOW TRACES (top 3 by duration):
{chr(10).join([f"- {t['duration_ms']}ms: {t['root_span']} โ {t['slowest_span']}" for t in traces[:3]])}
Based on this data:
1. What is the most likely root cause?
2. What immediate mitigation steps would you take?
3. What additional data would help confirm the diagnosis?
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.1, # low temperature for analytical tasks
)
return response.choices[0].message.content
Observability as Code
Define dashboards, alerts, and recording rules in version-controlled files:
# alerts/api-service.yml โ Prometheus alert rules
groups:
- name: api-service
rules:
# Error budget burn rate (SLO-based alerting)
- alert: ErrorBudgetBurnRateFast
expr: |
(
rate(http_requests_total{status=~"5.."}[1h]) /
rate(http_requests_total[1h])
) > 14.4 * 0.001 # 14.4x burn rate on 99.9% SLO
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Fast error budget burn: {{ $value | humanizePercentage }} error rate"
runbook: "https://runbooks.example.com/api-high-error-rate"
# Latency SLO
- alert: LatencySLOBreach
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency {{ $value | humanizeDuration }} exceeds 1s SLO"
# Generate Grafana dashboards programmatically
import json
def create_service_dashboard(service_name: str) -> dict:
"""Generate a standard service dashboard."""
return {
"title": f"{service_name} Service Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": f'sum(rate(http_requests_total{{service="{service_name}"}}[5m])) by (status_code)',
"legendFormat": "{{status_code}}"
}]
},
{
"title": "Error Rate %",
"type": "stat",
"targets": [{
"expr": f'100 * sum(rate(http_requests_total{{service="{service_name}",status=~"5.."}}[5m])) / sum(rate(http_requests_total{{service="{service_name}"}}[5m]))'
}],
"thresholds": {"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]}
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"targets": [
{"expr": f'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{{service="{service_name}"}}[5m]))', "legendFormat": "p50"},
{"expr": f'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{{service="{service_name}"}}[5m]))', "legendFormat": "p95"},
{"expr": f'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{{service="{service_name}"}}[5m]))', "legendFormat": "p99"},
]
}
]
}
# Deploy via Grafana API
import requests
dashboard = create_service_dashboard("payment-service")
requests.post("http://grafana:3000/api/dashboards/db",
json={"dashboard": dashboard, "overwrite": True},
headers={"Authorization": "Bearer your-api-key"})
Continuous Verification with Synthetic Monitoring
Don’t wait for users to report problems โ simulate them:
# synthetic_monitor.py โ runs every minute
import httpx
import time
from prometheus_client import Histogram, Counter, start_http_server
synthetic_duration = Histogram('synthetic_check_duration_seconds',
'Duration of synthetic checks',
['check_name', 'status'])
synthetic_failures = Counter('synthetic_check_failures_total',
'Total synthetic check failures',
['check_name'])
async def check_user_login_flow():
"""Simulate a complete user login and data fetch."""
async with httpx.AsyncClient(base_url="https://api.example.com") as client:
start = time.time()
try:
# Step 1: Login
login = await client.post("/auth/login",
json={"email": "[email protected]", "password": "test-password"})
assert login.status_code == 200, f"Login failed: {login.status_code}"
token = login.json()["token"]
# Step 2: Fetch user data
profile = await client.get("/api/me",
headers={"Authorization": f"Bearer {token}"})
assert profile.status_code == 200
assert "email" in profile.json()
# Step 3: Fetch recent orders
orders = await client.get("/api/orders?limit=5",
headers={"Authorization": f"Bearer {token}"})
assert orders.status_code == 200
duration = time.time() - start
synthetic_duration.labels("user_login_flow", "success").observe(duration)
except Exception as e:
duration = time.time() - start
synthetic_duration.labels("user_login_flow", "failure").observe(duration)
synthetic_failures.labels("user_login_flow").inc()
# Alert if this fails
raise
Resources
- OpenTelemetry Documentation
- OTel Collector Configuration
- Grafana Tempo (distributed tracing)
- Pyroscope (continuous profiling)
- Google SRE: Monitoring Distributed Systems
Comments