Distributed Tracing in Go

Introduction

Distributed tracing is a critical observability tool for understanding how requests flow through complex microservice architectures. When a single user request touches dozens of services, traditional logging becomes insufficient. Distributed tracing provides end-to-end visibility into request paths, latencies, and failures across your entire system.

In this comprehensive guide, you’ll learn how to implement distributed tracing in Go using industry-standard tools like Jaeger, Zipkin, and OpenTelemetry. We’ll cover instrumentation patterns, context propagation, and practical examples that you can apply to production systems.

Core Concepts

What is Distributed Tracing?

Distributed tracing tracks a single request as it flows through multiple services. Each request gets a unique trace ID, and each operation within that request gets a span ID. This creates a hierarchical view of how your system processes requests.

Key Components:

Trace: A complete request journey through your system
Span: A single operation within a trace (database query, HTTP call, etc.)
Trace ID: Unique identifier for the entire request
Span ID: Unique identifier for a specific operation
Parent Span ID: Links child spans to their parent operations

Why Distributed Tracing Matters

In microservice architectures, a single user request might:

Hit an API gateway
Call an authentication service
Query a user service
Access a database
Call multiple downstream services

Without tracing, debugging latency issues or failures becomes a nightmare. Distributed tracing gives you:

Request visibility: See the complete path a request takes
Performance analysis: Identify bottlenecks and slow services
Error tracking: Understand where and why failures occur
Dependency mapping: Discover service relationships automatically
Root cause analysis: Quickly identify the source of problems

Good: Implementing Distributed Tracing with OpenTelemetry

OpenTelemetry is the modern standard for observability in Go. It provides a vendor-neutral API for tracing, metrics, and logs.

Basic Setup with OpenTelemetry

package main

import (
	"context"
	"fmt"
	"log"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/jaeger/otlp"
	"go.opentelemetry.io/otel/sdk/resource"
	"go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)

// InitializeTracer sets up OpenTelemetry with Jaeger exporter
func InitializeTracer() (*trace.TracerProvider, error) {
	// Create Jaeger exporter
	exporter, err := otlp.New(context.Background())
	if err != nil {
		return nil, fmt.Errorf("failed to create exporter: %w", err)
	}

	// Create resource
	res, err := resource.New(context.Background(),
		resource.WithAttributes(
			semconv.ServiceNameKey.String("my-service"),
			semconv.ServiceVersionKey.String("1.0.0"),
		),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to create resource: %w", err)
	}

	// Create tracer provider
	tp := trace.NewTracerProvider(
		trace.WithBatcher(exporter),
		trace.WithResource(res),
	)

	// Set global tracer provider
	otel.SetTracerProvider(tp)

	return tp, nil
}

func main() {
	// Initialize tracer
	tp, err := InitializeTracer()
	if err != nil {
		log.Fatal(err)
	}
	defer func() {
		if err := tp.Shutdown(context.Background()); err != nil {
			log.Printf("Error shutting down tracer: %v", err)
		}
	}()

	// Get tracer
	tracer := otel.Tracer("my-service")

	// Create a span
	ctx, span := tracer.Start(context.Background(), "main-operation")
	defer span.End()

	// Do work
	fmt.Println("Processing request...")
}

Creating and Managing Spans

package main

import (
	"context"
	"fmt"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
)

// ProcessUserRequest demonstrates span creation and management
func ProcessUserRequest(ctx context.Context, userID string) error {
	tracer := otel.Tracer("user-service")

	// Create a span for the entire operation
	ctx, span := tracer.Start(ctx, "process-user-request")
	defer span.End()

	// Add attributes to the span
	span.SetAttributes(
		attribute.String("user.id", userID),
		attribute.String("operation", "process"),
	)

	// Simulate fetching user
	if err := fetchUser(ctx, userID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "failed to fetch user")
		return err
	}

	// Simulate updating user
	if err := updateUser(ctx, userID); err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, "failed to update user")
		return err
	}

	span.SetStatus(codes.Ok, "user processed successfully")
	return nil
}

// fetchUser creates a child span
func fetchUser(ctx context.Context, userID string) error {
	tracer := otel.Tracer("user-service")
	ctx, span := tracer.Start(ctx, "fetch-user")
	defer span.End()

	span.SetAttributes(attribute.String("user.id", userID))

	// Simulate database query
	fmt.Printf("Fetching user %s from database\n", userID)
	return nil
}

// updateUser creates another child span
func updateUser(ctx context.Context, userID string) error {
	tracer := otel.Tracer("user-service")
	ctx, span := tracer.Start(ctx, "update-user")
	defer span.End()

	span.SetAttributes(attribute.String("user.id", userID))

	// Simulate database update
	fmt.Printf("Updating user %s in database\n", userID)
	return nil
}

Context Propagation Across Services

package main

import (
	"context"
	"fmt"
	"net/http"

	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/propagation"
)

// SetupHTTPClient creates an HTTP client with tracing
func SetupHTTPClient() *http.Client {
	return &http.Client{
		Transport: otelhttp.NewTransport(http.DefaultTransport),
	}
}

// CallDownstreamService demonstrates context propagation
func CallDownstreamService(ctx context.Context, serviceURL string) error {
	tracer := otel.Tracer("api-gateway")
	ctx, span := tracer.Start(ctx, "call-downstream-service")
	defer span.End()

	// Create HTTP request
	req, err := http.NewRequestWithContext(ctx, "GET", serviceURL, nil)
	if err != nil {
		return err
	}

	// Propagate trace context to downstream service
	otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

	// Make request
	client := SetupHTTPClient()
	resp, err := client.Do(req)
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	fmt.Printf("Response status: %d\n", resp.StatusCode)
	return nil
}

// HTTPHandler demonstrates server-side tracing
func HTTPHandler(w http.ResponseWriter, r *http.Request) {
	tracer := otel.Tracer("user-service")

	// Extract trace context from incoming request
	ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
	ctx, span := tracer.Start(ctx, "handle-user-request")
	defer span.End()

	// Process request
	w.WriteHeader(http.StatusOK)
	w.Write([]byte("User processed"))
}

Bad: Manual Tracing Without Standards

package main

import (
	"fmt"
	"time"
)

// ❌ BAD: Manual tracing without standards
type ManualTrace struct {
	TraceID string
	Spans   []ManualSpan
}

type ManualSpan struct {
	Name      string
	StartTime time.Time
	EndTime   time.Time
	Duration  time.Duration
}

// This approach has many problems:
// 1. No context propagation between services
// 2. Manual span management is error-prone
// 3. No standard format for trace data
// 4. Difficult to correlate traces across services
// 5. No integration with observability platforms

func (t *ManualTrace) RecordSpan(name string, duration time.Duration) {
	span := ManualSpan{
		Name:     name,
		Duration: duration,
	}
	t.Spans = append(t.Spans, span)
}

func main() {
	trace := ManualTrace{TraceID: "manual-123"}

	// Manual timing
	start := time.Now()
	// Do work
	duration := time.Since(start)
	trace.RecordSpan("operation", duration)

	fmt.Printf("Trace: %v\n", trace)
}

Advanced Patterns

Sampling Strategies

package main

import (
	"go.opentelemetry.io/otel/sdk/trace"
)

// ConfigureSampling sets up different sampling strategies
func ConfigureSampling() trace.Sampler {
	// Always sample (development)
	// return trace.AlwaysSample()

	// Never sample (disable tracing)
	// return trace.NeverSample()

	// Sample 10% of traces (production)
	return trace.TraceIDRatioBased(0.1)

	// Probabilistic sampling based on trace ID
	// return trace.ProbabilitySampler(0.1)
}

// Adaptive sampling based on error rate
type AdaptiveSampler struct {
	baseRate float64
	errorRate float64
}

func (s *AdaptiveSampler) ShouldSample(parameters trace.SamplingParameters) trace.SamplingResult {
	// Sample more traces if error rate is high
	if s.errorRate > 0.05 {
		return trace.SamplingResult{Decision: trace.RecordAndSample}
	}
	
	// Use base rate otherwise
	if parameters.TraceID.HasRandomBits() {
		return trace.SamplingResult{Decision: trace.RecordAndSample}
	}
	
	return trace.SamplingResult{Decision: trace.Drop}
}

func (s *AdaptiveSampler) Description() string {
	return "AdaptiveSampler"
}

Baggage for Cross-Cutting Concerns

package main

import (
	"context"

	"go.opentelemetry.io/otel/baggage"
	"go.opentelemetry.io/otel/attribute"
)

// AddBaggageToContext adds metadata that propagates across services
func AddBaggageToContext(ctx context.Context, userID, tenantID string) (context.Context, error) {
	// Create baggage members
	members, err := baggage.NewMember("user.id", userID)
	if err != nil {
		return ctx, err
	}

	tenantMember, err := baggage.NewMember("tenant.id", tenantID)
	if err != nil {
		return ctx, err
	}

	// Create baggage
	bag, err := baggage.New(members, tenantMember)
	if err != nil {
		return ctx, err
	}

	// Add to context
	return baggage.ContextWithBaggage(ctx, bag), nil
}

// RetrieveBaggageFromContext extracts metadata from context
func RetrieveBaggageFromContext(ctx context.Context) map[string]string {
	bag := baggage.FromContext(ctx)
	result := make(map[string]string)

	for _, member := range bag.Members() {
		result[member.Key()] = member.Value()
	}

	return result
}

Metrics Integration with Tracing

package main

import (
	"context"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/metric"
)

// TraceWithMetrics combines tracing and metrics
func TraceWithMetrics(ctx context.Context, operationName string) error {
	tracer := otel.Tracer("service")
	meter := otel.Meter("service")

	// Create span
	ctx, span := tracer.Start(ctx, operationName)
	defer span.End()

	// Create counter
	counter, _ := meter.Int64Counter("operations.total")
	counter.Add(ctx, 1, metric.WithAttributes(
		attribute.String("operation", operationName),
	))

	// Create histogram for duration
	histogram, _ := meter.Float64Histogram("operation.duration")

	start := time.Now()
	// Do work
	duration := time.Since(start).Seconds()

	histogram.Record(ctx, duration, metric.WithAttributes(
		attribute.String("operation", operationName),
	))

	return nil
}

Best Practices

1. Always Propagate Context

// ✅ GOOD: Always pass context through function calls
func ProcessRequest(ctx context.Context, data string) error {
	// Context flows through the call chain
	return validateData(ctx, data)
}

func validateData(ctx context.Context, data string) error {
	// Context is available for tracing
	return nil
}

// ❌ BAD: Losing context
func ProcessRequestBad(data string) error {
	// No context passed - tracing breaks
	return validateDataBad(data)
}

func validateDataBad(data string) error {
	return nil
}

2. Use Meaningful Span Names

// ✅ GOOD: Descriptive span names
tracer.Start(ctx, "fetch-user-from-database")
tracer.Start(ctx, "validate-email-format")
tracer.Start(ctx, "send-confirmation-email")

// ❌ BAD: Vague span names
tracer.Start(ctx, "do-work")
tracer.Start(ctx, "process")
tracer.Start(ctx, "execute")

3. Add Relevant Attributes

// ✅ GOOD: Rich attributes for debugging
span.SetAttributes(
	attribute.String("user.id", userID),
	attribute.String("email", email),
	attribute.Int("retry.count", retries),
	attribute.Bool("is.admin", isAdmin),
)

// ❌ BAD: No attributes
span.SetAttributes(
	attribute.String("data", "some data"),
)

4. Handle Errors Properly

// ✅ GOOD: Record errors in spans
if err != nil {
	span.RecordError(err)
	span.SetStatus(codes.Error, err.Error())
	return err
}

// ❌ BAD: Ignore errors in tracing
if err != nil {
	return err
}

5. Configure Appropriate Sampling

// ✅ GOOD: Use sampling in production
sampler := trace.TraceIDRatioBased(0.1) // 10% sampling

// ❌ BAD: Sample everything in production
sampler := trace.AlwaysSample() // High overhead

Common Pitfalls

1. Context Leaks

// ❌ BAD: Context lost in goroutine
go func() {
	// ctx is not available here
	doWork()
}()

// ✅ GOOD: Pass context to goroutine
go func(ctx context.Context) {
	// ctx is available
	doWork(ctx)
}(ctx)

2. Span Leaks

// ❌ BAD: Span not ended
ctx, span := tracer.Start(ctx, "operation")
// Forgot to defer span.End()

// ✅ GOOD: Always defer span.End()
ctx, span := tracer.Start(ctx, "operation")
defer span.End()

3. Over-Instrumentation

// ❌ BAD: Too many spans
for i := 0; i < 1000; i++ {
	ctx, span := tracer.Start(ctx, "loop-iteration")
	defer span.End()
}

// ✅ GOOD: Batch operations
ctx, span := tracer.Start(ctx, "process-batch")
defer span.End()
for i := 0; i < 1000; i++ {
	// Process item
}

Resources

OpenTelemetry Go Documentation: https://opentelemetry.io/docs/instrumentation/go/
Jaeger Documentation: https://www.jaegertracing.io/docs/
Zipkin Documentation: https://zipkin.io/
OpenTelemetry Specification: https://opentelemetry.io/docs/reference/specification/
Distributed Tracing Best Practices: https://opentelemetry.io/docs/concepts/observability-primer/

Summary

Distributed tracing is essential for understanding and debugging microservice architectures. By implementing OpenTelemetry in your Go applications, you gain:

Complete visibility into request flows across services
Performance insights to identify bottlenecks
Error tracking to quickly resolve issues
Automatic dependency mapping of your system
Production-ready observability with industry standards

Start with basic span creation and context propagation, then gradually add more sophisticated patterns like sampling, baggage, and metrics integration. Remember to always propagate context, use meaningful span names, and add relevant attributes for effective debugging.