Introduction
Observability is the ability to understand what’s happening inside a system by examining its external outputs. For distributed systems โ where a single user request might touch dozens of services โ observability is not optional. Without it, debugging production issues becomes guesswork.
The three pillars of observability are metrics, logs, and traces. Together they answer different questions:
- Metrics: “Is the system healthy right now?” (CPU at 90%, error rate 2%)
- Logs: “What exactly happened at 14:32:05?” (specific events with context)
- Traces: “Why did this request take 3 seconds?” (full request path across services)
Metrics
Metrics are numerical measurements collected at regular intervals. They’re efficient to store and query, making them ideal for dashboards and alerting.
Types of Metrics
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing | Total requests, errors |
| Gauge | Can go up or down | Current connections, memory usage |
| Histogram | Distribution of values | Request duration, response size |
| Summary | Similar to histogram, pre-calculated percentiles | p50, p95, p99 latency |
Prometheus: The Standard
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
// Go: expose metrics with prometheus client
import "github.com/prometheus/client_golang/prometheus"
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "path", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path"},
)
)
func init() {
prometheus.MustRegister(requestsTotal, requestDuration)
}
// Middleware to record metrics
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
rw := &responseWriter{ResponseWriter: w, status: 200}
next.ServeHTTP(rw, r)
duration := time.Since(start).Seconds()
requestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.status)).Inc()
requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
})
}
Key Metrics to Track
# Application metrics
http_requests_total{method, path, status}
http_request_duration_seconds{method, path}
active_connections
error_rate
# Business metrics
orders_created_total
revenue_total
active_users
# Infrastructure metrics
cpu_usage_percent
memory_usage_bytes
disk_io_bytes
network_bytes_total
Logs
Logs capture discrete events with timestamps and context. Structured logging (JSON) makes logs machine-parseable.
Structured Logging
// Go: structured logging with slog (Go 1.21+)
import "log/slog"
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
}))
// Structured log entry
logger.Info("request completed",
"method", r.Method,
"path", r.URL.Path,
"status", 200,
"duration_ms", 45,
"user_id", userID,
"trace_id", traceID,
)
{
"time": "2026-03-30T10:00:00Z",
"level": "INFO",
"msg": "request completed",
"method": "GET",
"path": "/api/users/42",
"status": 200,
"duration_ms": 45,
"user_id": "user-123",
"trace_id": "abc123def456"
}
Log Levels
DEBUG โ detailed diagnostic info (dev only)
INFO โ normal operations, key events
WARN โ unexpected but handled situations
ERROR โ failures that need attention
FATAL โ unrecoverable errors (app exits)
Log Aggregation: ELK Stack
# docker-compose.yml for ELK
services:
elasticsearch:
image: elasticsearch:8.12.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports: ["9200:9200"]
logstash:
image: logstash:8.12.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
kibana:
image: kibana:8.12.0
ports: ["5601:5601"]
depends_on: [elasticsearch]
Distributed Tracing
Traces follow a request across service boundaries, showing the full execution path and timing of each operation.
Concepts
Trace: the complete journey of a request
โโโ Span: a single operation within the trace
โโโ Span: database query (50ms)
โโโ Span: cache lookup (2ms)
โโโ Span: external API call (120ms)
Each span has:
trace_idโ shared across all spans in a tracespan_idโ unique to this spanparent_span_idโ links to parent spanstart_time,end_timeattributesโ key-value metadataeventsโ timestamped log entries within the spanstatusโ OK, ERROR
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. It replaces vendor-specific SDKs (Jaeger, Zipkin, Datadog) with a single API.
// Go: OpenTelemetry setup
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer(ctx context.Context) (*trace.TracerProvider, error) {
exporter, err := otlptracehttp.New(ctx,
otlptracehttp.WithEndpoint("http://otel-collector:4318"),
)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("my-service"),
semconv.ServiceVersion("1.0.0"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
// Instrument a function
func processOrder(ctx context.Context, orderID string) error {
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "processOrder",
trace.WithAttributes(attribute.String("order.id", orderID)),
)
defer span.End()
// Database call โ creates a child span
if err := db.SaveOrder(ctx, orderID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.AddEvent("order saved to database")
return nil
}
Context Propagation
Traces work across services by propagating context via HTTP headers:
// Outgoing HTTP request โ inject trace context
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// Adds: traceparent: 00-abc123-def456-01
// Incoming HTTP request โ extract trace context
ctx = otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
The OpenTelemetry Collector
The OTel Collector receives telemetry from your apps and routes it to backends:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch:
timeout: 1s
memory_limiter:
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger:14250
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Alerting Strategies
SLO-Based Alerting
Service Level Objectives define target reliability. Alert on burn rate โ how fast you’re consuming your error budget:
# Prometheus alerting rules
groups:
- name: slo-alerts
rules:
# Alert if error rate > 1% for 5 minutes
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate: {{ $value | humanizePercentage }}"
# Alert if p99 latency > 1 second
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
Grafana Dashboard
// Key panels for a service dashboard
{
"panels": [
"Request rate (req/s)",
"Error rate (%)",
"P50/P95/P99 latency",
"Active connections",
"CPU and memory usage",
"Upstream service latency"
]
}
Sampling Strategies
High-volume systems can’t trace every request. Sampling reduces data volume:
// Head-based sampling: decide at trace start
tp := trace.NewTracerProvider(
trace.WithSampler(trace.TraceIDRatioBased(0.1)), // sample 10%
)
// Tail-based sampling: decide after trace completes
// (requires OTel Collector with tail sampling processor)
# OTel Collector tail sampling
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] } # always sample errors
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 } # always sample slow traces
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 5 } # 5% of the rest
Observability Stack Options
| Component | Open Source | Managed |
|---|---|---|
| Metrics | Prometheus + Grafana | Datadog, New Relic, Grafana Cloud |
| Logs | ELK Stack, Loki | Datadog, Splunk, CloudWatch |
| Traces | Jaeger, Tempo | Datadog, Honeycomb, Lightstep |
| All-in-one | Grafana Stack | Datadog, New Relic, Dynatrace |
Cost Management
Observability data grows fast. Control costs:
High cardinality labels โ exponential metric growth
Bad: http_requests_total{user_id="12345"} (millions of series)
Good: http_requests_total{status="200"} (few series)
Log retention:
Hot (recent, fast): 7-30 days
Cold (archive): 90-365 days
Trace sampling:
100% in dev/staging
1-10% in production (always sample errors)
Resources
- OpenTelemetry Documentation
- Prometheus Documentation
- Grafana Documentation
- Jaeger Distributed Tracing
- Google SRE Book: Monitoring
Comments