Skip to main content
โšก Calmops

Observability Architecture: Metrics, Logs, Traces, and OpenTelemetry

Introduction

Observability is the ability to understand what’s happening inside a system by examining its external outputs. For distributed systems โ€” where a single user request might touch dozens of services โ€” observability is not optional. Without it, debugging production issues becomes guesswork.

The three pillars of observability are metrics, logs, and traces. Together they answer different questions:

  • Metrics: “Is the system healthy right now?” (CPU at 90%, error rate 2%)
  • Logs: “What exactly happened at 14:32:05?” (specific events with context)
  • Traces: “Why did this request take 3 seconds?” (full request path across services)

Metrics

Metrics are numerical measurements collected at regular intervals. They’re efficient to store and query, making them ideal for dashboards and alerting.

Types of Metrics

Type Description Example
Counter Monotonically increasing Total requests, errors
Gauge Can go up or down Current connections, memory usage
Histogram Distribution of values Request duration, response size
Summary Similar to histogram, pre-calculated percentiles p50, p95, p99 latency

Prometheus: The Standard

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
// Go: expose metrics with prometheus client
import "github.com/prometheus/client_golang/prometheus"

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "path"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal, requestDuration)
}

// Middleware to record metrics
func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        rw := &responseWriter{ResponseWriter: w, status: 200}

        next.ServeHTTP(rw, r)

        duration := time.Since(start).Seconds()
        requestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.status)).Inc()
        requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    })
}

Key Metrics to Track

# Application metrics
http_requests_total{method, path, status}
http_request_duration_seconds{method, path}
active_connections
error_rate

# Business metrics
orders_created_total
revenue_total
active_users

# Infrastructure metrics
cpu_usage_percent
memory_usage_bytes
disk_io_bytes
network_bytes_total

Logs

Logs capture discrete events with timestamps and context. Structured logging (JSON) makes logs machine-parseable.

Structured Logging

// Go: structured logging with slog (Go 1.21+)
import "log/slog"

logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level: slog.LevelInfo,
}))

// Structured log entry
logger.Info("request completed",
    "method", r.Method,
    "path", r.URL.Path,
    "status", 200,
    "duration_ms", 45,
    "user_id", userID,
    "trace_id", traceID,
)
{
  "time": "2026-03-30T10:00:00Z",
  "level": "INFO",
  "msg": "request completed",
  "method": "GET",
  "path": "/api/users/42",
  "status": 200,
  "duration_ms": 45,
  "user_id": "user-123",
  "trace_id": "abc123def456"
}

Log Levels

DEBUG   โ€” detailed diagnostic info (dev only)
INFO    โ€” normal operations, key events
WARN    โ€” unexpected but handled situations
ERROR   โ€” failures that need attention
FATAL   โ€” unrecoverable errors (app exits)

Log Aggregation: ELK Stack

# docker-compose.yml for ELK
services:
  elasticsearch:
    image: elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports: ["9200:9200"]

  logstash:
    image: logstash:8.12.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf

  kibana:
    image: kibana:8.12.0
    ports: ["5601:5601"]
    depends_on: [elasticsearch]

Distributed Tracing

Traces follow a request across service boundaries, showing the full execution path and timing of each operation.

Concepts

Trace: the complete journey of a request
  โ””โ”€โ”€ Span: a single operation within the trace
        โ”œโ”€โ”€ Span: database query (50ms)
        โ”œโ”€โ”€ Span: cache lookup (2ms)
        โ””โ”€โ”€ Span: external API call (120ms)

Each span has:

  • trace_id โ€” shared across all spans in a trace
  • span_id โ€” unique to this span
  • parent_span_id โ€” links to parent span
  • start_time, end_time
  • attributes โ€” key-value metadata
  • events โ€” timestamped log entries within the span
  • status โ€” OK, ERROR

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. It replaces vendor-specific SDKs (Jaeger, Zipkin, Datadog) with a single API.

// Go: OpenTelemetry setup
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer(ctx context.Context) (*trace.TracerProvider, error) {
    exporter, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint("http://otel-collector:4318"),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("my-service"),
            semconv.ServiceVersion("1.0.0"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

// Instrument a function
func processOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "processOrder",
        trace.WithAttributes(attribute.String("order.id", orderID)),
    )
    defer span.End()

    // Database call โ€” creates a child span
    if err := db.SaveOrder(ctx, orderID); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    span.AddEvent("order saved to database")
    return nil
}

Context Propagation

Traces work across services by propagating context via HTTP headers:

// Outgoing HTTP request โ€” inject trace context
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// Adds: traceparent: 00-abc123-def456-01

// Incoming HTTP request โ€” extract trace context
ctx = otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))

The OpenTelemetry Collector

The OTel Collector receives telemetry from your apps and routes it to backends:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  batch:
    timeout: 1s
  memory_limiter:
    limit_mib: 512

exporters:
  jaeger:
    endpoint: jaeger:14250
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Alerting Strategies

SLO-Based Alerting

Service Level Objectives define target reliability. Alert on burn rate โ€” how fast you’re consuming your error budget:

# Prometheus alerting rules
groups:
  - name: slo-alerts
    rules:
      # Alert if error rate > 1% for 5 minutes
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"

      # Alert if p99 latency > 1 second
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning

Grafana Dashboard

// Key panels for a service dashboard
{
  "panels": [
    "Request rate (req/s)",
    "Error rate (%)",
    "P50/P95/P99 latency",
    "Active connections",
    "CPU and memory usage",
    "Upstream service latency"
  ]
}

Sampling Strategies

High-volume systems can’t trace every request. Sampling reduces data volume:

// Head-based sampling: decide at trace start
tp := trace.NewTracerProvider(
    trace.WithSampler(trace.TraceIDRatioBased(0.1)), // sample 10%
)

// Tail-based sampling: decide after trace completes
// (requires OTel Collector with tail sampling processor)
# OTel Collector tail sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }  # always sample errors
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }  # always sample slow traces
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }  # 5% of the rest

Observability Stack Options

Component Open Source Managed
Metrics Prometheus + Grafana Datadog, New Relic, Grafana Cloud
Logs ELK Stack, Loki Datadog, Splunk, CloudWatch
Traces Jaeger, Tempo Datadog, Honeycomb, Lightstep
All-in-one Grafana Stack Datadog, New Relic, Dynatrace

Cost Management

Observability data grows fast. Control costs:

High cardinality labels โ†’ exponential metric growth
  Bad:  http_requests_total{user_id="12345"}  (millions of series)
  Good: http_requests_total{status="200"}     (few series)

Log retention:
  Hot (recent, fast): 7-30 days
  Cold (archive): 90-365 days

Trace sampling:
  100% in dev/staging
  1-10% in production (always sample errors)

Resources

Comments