Observability Architecture: Metrics, Logs, Traces, and OpenTelemetry

Introduction

Observability is the ability to understand what’s happening inside a system by examining its external outputs. For distributed systems — where a single user request might touch dozens of services — observability is not optional. Without it, debugging production issues becomes guesswork.

The three pillars of observability are metrics, logs, and traces. Together they answer different questions:

Metrics: “Is the system healthy right now?” (CPU at 90%, error rate 2%)
Logs: “What exactly happened at 14:32:05?” (specific events with context)
Traces: “Why did this request take 3 seconds?” (full request path across services)

Metrics

Metrics are numerical measurements collected at regular intervals. They’re efficient to store and query, making them ideal for dashboards and alerting.

Types of Metrics

Type	Description	Example
Counter	Monotonically increasing	Total requests, errors
Gauge	Can go up or down	Current connections, memory usage
Histogram	Distribution of values	Request duration, response size
Summary	Similar to histogram, pre-calculated percentiles	p50, p95, p99 latency

Prometheus: The Standard

## prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'

// Go: expose metrics with prometheus client
import "github.com/prometheus/client_golang/prometheus"

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "path"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal, requestDuration)
}

// Middleware to record metrics
func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        rw := &responseWriter{ResponseWriter: w, status: 200}

        next.ServeHTTP(rw, r)

        duration := time.Since(start).Seconds()
        requestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.status)).Inc()
        requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    })
}

Key Metrics to Track

## Application metrics
http_requests_total{method, path, status}
http_request_duration_seconds{method, path}
active_connections
error_rate

## Business metrics
orders_created_total
revenue_total
active_users

## Infrastructure metrics
cpu_usage_percent
memory_usage_bytes
disk_io_bytes
network_bytes_total

Logs

Logs capture discrete events with timestamps and context. Structured logging (JSON) makes logs machine-parseable.

Structured Logging

// Go: structured logging with slog (Go 1.21+)
import "log/slog"

logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level: slog.LevelInfo,
}))

// Structured log entry
logger.Info("request completed",
    "method", r.Method,
    "path", r.URL.Path,
    "status", 200,
    "duration_ms", 45,
    "user_id", userID,
    "trace_id", traceID,
)

{
  "time": "2026-03-30T10:00:00Z",
  "level": "INFO",
  "msg": "request completed",
  "method": "GET",
  "path": "/api/users/42",
  "status": 200,
  "duration_ms": 45,
  "user_id": "user-123",
  "trace_id": "abc123def456"
}

Log Levels

DEBUG   — detailed diagnostic info (dev only)
INFO    — normal operations, key events
WARN    — unexpected but handled situations
ERROR   — failures that need attention
FATAL   — unrecoverable errors (app exits)

Log Aggregation: ELK Stack

## docker-compose.yml for ELK
services:
  elasticsearch:
    image: elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports: ["9200:9200"]

  logstash:
    image: logstash:8.12.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf

  kibana:
    image: kibana:8.12.0
    ports: ["5601:5601"]
    depends_on: [elasticsearch]

Distributed Tracing

Traces follow a request across service boundaries, showing the full execution path and timing of each operation.

Concepts

Trace: the complete journey of a request
  └── Span: a single operation within the trace
        ├── Span: database query (50ms)
        ├── Span: cache lookup (2ms)
        └── Span: external API call (120ms)

Each span has:

trace_id — shared across all spans in a trace
span_id — unique to this span
parent_span_id — links to parent span
start_time, end_time
attributes — key-value metadata
events — timestamped log entries within the span
status — OK, ERROR

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. It replaces vendor-specific SDKs (Jaeger, Zipkin, Datadog) with a single API.

// Go: OpenTelemetry setup
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer(ctx context.Context) (*trace.TracerProvider, error) {
    exporter, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint("http://otel-collector:4318"),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("my-service"),
            semconv.ServiceVersion("1.0.0"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

// Instrument a function
func processOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "processOrder",
        trace.WithAttributes(attribute.String("order.id", orderID)),
    )
    defer span.End()

    // Database call — creates a child span
    if err := db.SaveOrder(ctx, orderID); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    span.AddEvent("order saved to database")
    return nil
}

Context Propagation

Traces work across services by propagating context via HTTP headers:

// Outgoing HTTP request — inject trace context
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// Adds: traceparent: 00-abc123-def456-01

// Incoming HTTP request — extract trace context
ctx = otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))

The OpenTelemetry Collector

The OTel Collector receives telemetry from your apps and routes it to backends:

## otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  batch:
    timeout: 1s
  memory_limiter:
    limit_mib: 512

exporters:
  jaeger:
    endpoint: jaeger:14250
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Alerting Strategies

SLO-Based Alerting

Service Level Objectives define target reliability. Alert on burn rate — how fast you’re consuming your error budget:

## Prometheus alerting rules
groups:
  - name: slo-alerts
    rules:
      # Alert if error rate > 1% for 5 minutes
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"

      # Alert if p99 latency > 1 second
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning

Grafana Dashboard

// Key panels for a service dashboard
{
  "panels": [
    "Request rate (req/s)",
    "Error rate (%)",
    "P50/P95/P99 latency",
    "Active connections",
    "CPU and memory usage",
    "Upstream service latency"
  ]
}

Sampling Strategies

High-volume systems can’t trace every request. Sampling reduces data volume:

// Head-based sampling: decide at trace start
tp := trace.NewTracerProvider(
    trace.WithSampler(trace.TraceIDRatioBased(0.1)), // sample 10%
)

// Tail-based sampling: decide after trace completes
// (requires OTel Collector with tail sampling processor)

## OTel Collector tail sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }  # always sample errors
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }  # always sample slow traces
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }  # 5% of the rest

Observability Stack Options

Component	Open Source	Managed
Metrics	Prometheus + Grafana	Datadog, New Relic, Grafana Cloud
Logs	ELK Stack, Loki	Datadog, Splunk, CloudWatch
Traces	Jaeger, Tempo	Datadog, Honeycomb, Lightstep
All-in-one	Grafana Stack	Datadog, New Relic, Dynatrace

Cost Management

Observability data grows fast. Control costs:

High cardinality labels → exponential metric growth
  Bad:  http_requests_total{user_id="12345"}  (millions of series)
  Good: http_requests_total{status="200"}     (few series)

Log retention:
  Hot (recent, fast): 7-30 days
  Cold (archive): 90-365 days

Trace sampling:
  100% in dev/staging
  1-10% in production (always sample errors)

Conclusion

Observability architecture requires intentional design across the three pillars: metrics for trends and alerting, logs for debugging, and traces for request flow understanding. Invest in a unified ingestion pipeline and consistent instrumentation across all services. The cost of observability scales with data volume—use sampling, aggregation, and retention policies strategically rather than collecting everything indiscriminately.

Observability Architecture: Metrics, Logs, Traces, and OpenTelemetry

Introduction

Metrics

Types of Metrics

Prometheus: The Standard

Key Metrics to Track

Logs

Structured Logging

Log Levels

Log Aggregation: ELK Stack

Distributed Tracing

Concepts

OpenTelemetry: The Standard

Context Propagation

The OpenTelemetry Collector

Alerting Strategies

SLO-Based Alerting

Grafana Dashboard

Sampling Strategies

Observability Stack Options

Cost Management

Conclusion

Resources

Comments

Share this article

👍 Was this article helpful?