Distributed Tracing in Go
Introduction
Distributed tracing is a critical observability tool for understanding how requests flow through complex microservice architectures. When a single user request touches dozens of services, traditional logging becomes insufficient. Distributed tracing provides end-to-end visibility into request paths, latencies, and failures across your entire system.
In this comprehensive guide, you’ll learn how to implement distributed tracing in Go using industry-standard tools like Jaeger, Zipkin, and OpenTelemetry. We’ll cover instrumentation patterns, context propagation, and practical examples that you can apply to production systems.
Core Concepts
What is Distributed Tracing?
Distributed tracing tracks a single request as it flows through multiple services. Each request gets a unique trace ID, and each operation within that request gets a span ID. This creates a hierarchical view of how your system processes requests.
Key Components:
- Trace: A complete request journey through your system
- Span: A single operation within a trace (database query, HTTP call, etc.)
- Trace ID: Unique identifier for the entire request
- Span ID: Unique identifier for a specific operation
- Parent Span ID: Links child spans to their parent operations
Why Distributed Tracing Matters
In microservice architectures, a single user request might:
- Hit an API gateway
- Call an authentication service
- Query a user service
- Access a database
- Call multiple downstream services
Without tracing, debugging latency issues or failures becomes a nightmare. Distributed tracing gives you:
- Request visibility: See the complete path a request takes
- Performance analysis: Identify bottlenecks and slow services
- Error tracking: Understand where and why failures occur
- Dependency mapping: Discover service relationships automatically
- Root cause analysis: Quickly identify the source of problems
Good: Implementing Distributed Tracing with OpenTelemetry
OpenTelemetry is the modern standard for observability in Go. It provides a vendor-neutral API for tracing, metrics, and logs.
Basic Setup with OpenTelemetry
package main
import (
"context"
"fmt"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger/otlp"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)
// InitializeTracer sets up OpenTelemetry with Jaeger exporter
func InitializeTracer() (*trace.TracerProvider, error) {
// Create Jaeger exporter
exporter, err := otlp.New(context.Background())
if err != nil {
return nil, fmt.Errorf("failed to create exporter: %w", err)
}
// Create resource
res, err := resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceVersionKey.String("1.0.0"),
),
)
if err != nil {
return nil, fmt.Errorf("failed to create resource: %w", err)
}
// Create tracer provider
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(res),
)
// Set global tracer provider
otel.SetTracerProvider(tp)
return tp, nil
}
func main() {
// Initialize tracer
tp, err := InitializeTracer()
if err != nil {
log.Fatal(err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer: %v", err)
}
}()
// Get tracer
tracer := otel.Tracer("my-service")
// Create a span
ctx, span := tracer.Start(context.Background(), "main-operation")
defer span.End()
// Do work
fmt.Println("Processing request...")
}
Creating and Managing Spans
package main
import (
"context"
"fmt"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
)
// ProcessUserRequest demonstrates span creation and management
func ProcessUserRequest(ctx context.Context, userID string) error {
tracer := otel.Tracer("user-service")
// Create a span for the entire operation
ctx, span := tracer.Start(ctx, "process-user-request")
defer span.End()
// Add attributes to the span
span.SetAttributes(
attribute.String("user.id", userID),
attribute.String("operation", "process"),
)
// Simulate fetching user
if err := fetchUser(ctx, userID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "failed to fetch user")
return err
}
// Simulate updating user
if err := updateUser(ctx, userID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "failed to update user")
return err
}
span.SetStatus(codes.Ok, "user processed successfully")
return nil
}
// fetchUser creates a child span
func fetchUser(ctx context.Context, userID string) error {
tracer := otel.Tracer("user-service")
ctx, span := tracer.Start(ctx, "fetch-user")
defer span.End()
span.SetAttributes(attribute.String("user.id", userID))
// Simulate database query
fmt.Printf("Fetching user %s from database\n", userID)
return nil
}
// updateUser creates another child span
func updateUser(ctx context.Context, userID string) error {
tracer := otel.Tracer("user-service")
ctx, span := tracer.Start(ctx, "update-user")
defer span.End()
span.SetAttributes(attribute.String("user.id", userID))
// Simulate database update
fmt.Printf("Updating user %s in database\n", userID)
return nil
}
Context Propagation Across Services
package main
import (
"context"
"fmt"
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
)
// SetupHTTPClient creates an HTTP client with tracing
func SetupHTTPClient() *http.Client {
return &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
}
// CallDownstreamService demonstrates context propagation
func CallDownstreamService(ctx context.Context, serviceURL string) error {
tracer := otel.Tracer("api-gateway")
ctx, span := tracer.Start(ctx, "call-downstream-service")
defer span.End()
// Create HTTP request
req, err := http.NewRequestWithContext(ctx, "GET", serviceURL, nil)
if err != nil {
return err
}
// Propagate trace context to downstream service
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// Make request
client := SetupHTTPClient()
resp, err := client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
fmt.Printf("Response status: %d\n", resp.StatusCode)
return nil
}
// HTTPHandler demonstrates server-side tracing
func HTTPHandler(w http.ResponseWriter, r *http.Request) {
tracer := otel.Tracer("user-service")
// Extract trace context from incoming request
ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
ctx, span := tracer.Start(ctx, "handle-user-request")
defer span.End()
// Process request
w.WriteHeader(http.StatusOK)
w.Write([]byte("User processed"))
}
Bad: Manual Tracing Without Standards
package main
import (
"fmt"
"time"
)
// โ BAD: Manual tracing without standards
type ManualTrace struct {
TraceID string
Spans []ManualSpan
}
type ManualSpan struct {
Name string
StartTime time.Time
EndTime time.Time
Duration time.Duration
}
// This approach has many problems:
// 1. No context propagation between services
// 2. Manual span management is error-prone
// 3. No standard format for trace data
// 4. Difficult to correlate traces across services
// 5. No integration with observability platforms
func (t *ManualTrace) RecordSpan(name string, duration time.Duration) {
span := ManualSpan{
Name: name,
Duration: duration,
}
t.Spans = append(t.Spans, span)
}
func main() {
trace := ManualTrace{TraceID: "manual-123"}
// Manual timing
start := time.Now()
// Do work
duration := time.Since(start)
trace.RecordSpan("operation", duration)
fmt.Printf("Trace: %v\n", trace)
}
Advanced Patterns
Sampling Strategies
package main
import (
"go.opentelemetry.io/otel/sdk/trace"
)
// ConfigureSampling sets up different sampling strategies
func ConfigureSampling() trace.Sampler {
// Always sample (development)
// return trace.AlwaysSample()
// Never sample (disable tracing)
// return trace.NeverSample()
// Sample 10% of traces (production)
return trace.TraceIDRatioBased(0.1)
// Probabilistic sampling based on trace ID
// return trace.ProbabilitySampler(0.1)
}
// Adaptive sampling based on error rate
type AdaptiveSampler struct {
baseRate float64
errorRate float64
}
func (s *AdaptiveSampler) ShouldSample(parameters trace.SamplingParameters) trace.SamplingResult {
// Sample more traces if error rate is high
if s.errorRate > 0.05 {
return trace.SamplingResult{Decision: trace.RecordAndSample}
}
// Use base rate otherwise
if parameters.TraceID.HasRandomBits() {
return trace.SamplingResult{Decision: trace.RecordAndSample}
}
return trace.SamplingResult{Decision: trace.Drop}
}
func (s *AdaptiveSampler) Description() string {
return "AdaptiveSampler"
}
Baggage for Cross-Cutting Concerns
package main
import (
"context"
"go.opentelemetry.io/otel/baggage"
"go.opentelemetry.io/otel/attribute"
)
// AddBaggageToContext adds metadata that propagates across services
func AddBaggageToContext(ctx context.Context, userID, tenantID string) (context.Context, error) {
// Create baggage members
members, err := baggage.NewMember("user.id", userID)
if err != nil {
return ctx, err
}
tenantMember, err := baggage.NewMember("tenant.id", tenantID)
if err != nil {
return ctx, err
}
// Create baggage
bag, err := baggage.New(members, tenantMember)
if err != nil {
return ctx, err
}
// Add to context
return baggage.ContextWithBaggage(ctx, bag), nil
}
// RetrieveBaggageFromContext extracts metadata from context
func RetrieveBaggageFromContext(ctx context.Context) map[string]string {
bag := baggage.FromContext(ctx)
result := make(map[string]string)
for _, member := range bag.Members() {
result[member.Key()] = member.Value()
}
return result
}
Metrics Integration with Tracing
package main
import (
"context"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
)
// TraceWithMetrics combines tracing and metrics
func TraceWithMetrics(ctx context.Context, operationName string) error {
tracer := otel.Tracer("service")
meter := otel.Meter("service")
// Create span
ctx, span := tracer.Start(ctx, operationName)
defer span.End()
// Create counter
counter, _ := meter.Int64Counter("operations.total")
counter.Add(ctx, 1, metric.WithAttributes(
attribute.String("operation", operationName),
))
// Create histogram for duration
histogram, _ := meter.Float64Histogram("operation.duration")
start := time.Now()
// Do work
duration := time.Since(start).Seconds()
histogram.Record(ctx, duration, metric.WithAttributes(
attribute.String("operation", operationName),
))
return nil
}
Best Practices
1. Always Propagate Context
// โ
GOOD: Always pass context through function calls
func ProcessRequest(ctx context.Context, data string) error {
// Context flows through the call chain
return validateData(ctx, data)
}
func validateData(ctx context.Context, data string) error {
// Context is available for tracing
return nil
}
// โ BAD: Losing context
func ProcessRequestBad(data string) error {
// No context passed - tracing breaks
return validateDataBad(data)
}
func validateDataBad(data string) error {
return nil
}
2. Use Meaningful Span Names
// โ
GOOD: Descriptive span names
tracer.Start(ctx, "fetch-user-from-database")
tracer.Start(ctx, "validate-email-format")
tracer.Start(ctx, "send-confirmation-email")
// โ BAD: Vague span names
tracer.Start(ctx, "do-work")
tracer.Start(ctx, "process")
tracer.Start(ctx, "execute")
3. Add Relevant Attributes
// โ
GOOD: Rich attributes for debugging
span.SetAttributes(
attribute.String("user.id", userID),
attribute.String("email", email),
attribute.Int("retry.count", retries),
attribute.Bool("is.admin", isAdmin),
)
// โ BAD: No attributes
span.SetAttributes(
attribute.String("data", "some data"),
)
4. Handle Errors Properly
// โ
GOOD: Record errors in spans
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
// โ BAD: Ignore errors in tracing
if err != nil {
return err
}
5. Configure Appropriate Sampling
// โ
GOOD: Use sampling in production
sampler := trace.TraceIDRatioBased(0.1) // 10% sampling
// โ BAD: Sample everything in production
sampler := trace.AlwaysSample() // High overhead
Common Pitfalls
1. Context Leaks
// โ BAD: Context lost in goroutine
go func() {
// ctx is not available here
doWork()
}()
// โ
GOOD: Pass context to goroutine
go func(ctx context.Context) {
// ctx is available
doWork(ctx)
}(ctx)
2. Span Leaks
// โ BAD: Span not ended
ctx, span := tracer.Start(ctx, "operation")
// Forgot to defer span.End()
// โ
GOOD: Always defer span.End()
ctx, span := tracer.Start(ctx, "operation")
defer span.End()
3. Over-Instrumentation
// โ BAD: Too many spans
for i := 0; i < 1000; i++ {
ctx, span := tracer.Start(ctx, "loop-iteration")
defer span.End()
}
// โ
GOOD: Batch operations
ctx, span := tracer.Start(ctx, "process-batch")
defer span.End()
for i := 0; i < 1000; i++ {
// Process item
}
Resources
- OpenTelemetry Go Documentation: https://opentelemetry.io/docs/instrumentation/go/
- Jaeger Documentation: https://www.jaegertracing.io/docs/
- Zipkin Documentation: https://zipkin.io/
- OpenTelemetry Specification: https://opentelemetry.io/docs/reference/specification/
- Distributed Tracing Best Practices: https://opentelemetry.io/docs/concepts/observability-primer/
Summary
Distributed tracing is essential for understanding and debugging microservice architectures. By implementing OpenTelemetry in your Go applications, you gain:
- Complete visibility into request flows across services
- Performance insights to identify bottlenecks
- Error tracking to quickly resolve issues
- Automatic dependency mapping of your system
- Production-ready observability with industry standards
Start with basic span creation and context propagation, then gradually add more sophisticated patterns like sampling, baggage, and metrics integration. Remember to always propagate context, use meaningful span names, and add relevant attributes for effective debugging.
Comments