Testing in Production: Feature Flags and Canary Releases

Introduction

Testing in production sounds risky, but it’s often the only way to find real-world bugs. Modern practices like feature flags and canary releases let you test safely in production with minimal risk.

Why Test in Production?

┌─────────────────────────────────────────────────────────────┐
│            Testing in Production Benefits                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ✓ Real users, real data, real conditions                  │
│  ✓ Find issues CI can't catch                              │
│  ✓ Faster feedback loops                                   │
│  ✓ A/B test new features                                   │
│  ✓ Instant rollback if issues occur                         │
│                                                             │
│  Risks:                                                     │
│  ✗ Users affected by bugs                                   │
│  ✗ Potential service disruptions                           │
│                                                             │
│  Solution: FEATURE FLAGS + CANARY                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Feature Flags

Basic Implementation

// Feature flag service
class FeatureFlags {
  private flags = new Map<string, boolean>();
  
  enable(feature: string) {
    this.flags.set(feature, true);
  }
  
  disable(feature: string) {
    this.flags.set(feature, false);
  }
  
  isEnabled(feature: string): boolean {
    return this.flags.get(feature) ?? false;
  }
}

const flags = new FeatureFlags();

// Usage in code
if (flags.isEnabled('new-dashboard')) {
  return <NewDashboard />;
} else {
  return <LegacyDashboard />;
}

With Providers

// Use LaunchDarkly, Split, or Statsig
import { LaunchDarkly } from 'launchdarkly-node-server-sdk';

const client = LaunchDarkly.init(process.env.LD_KEY!);

// Check feature flag
async function checkFlag(userId: string, flag: string) {
  const value = await client.variation(flag, {
    key: userId
  }, false);
  return value;
}

// In Express route
app.get('/dashboard', async (req, res) => {
  const userId = req.user.id;
  const useNewDashboard = await checkFlag(userId, 'new-dashboard');
  
  if (useNewDashboard) {
    return res.render('dashboard-new');
  }
  return res.render('dashboard-legacy');
});

Gradual Rollout

// Percentage rollout
async function isInRollout(userId: string, percentage: number): Promise<boolean> {
  // Simple hash-based deterministic selection
  const hash = hashCode(userId);
  const bucket = Math.abs(hash) % 100;
  return bucket < percentage;
}

// Usage
const rollout = await isInRollout(userId, 10); // 10% rollout

if (rollout) {
  enableFeature('new-checkout');
}

Canary Deployments

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Canary Deployment                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│                    Load Balancer                              │
│                         │                                    │
│           ┌─────────────┼─────────────┐                     │
│           │             │             │                     │
│      ┌────▼────┐   ┌────▼────┐   ┌────▼────┐              │
│      │ Canary  │   │ Canary  │   │  Main   │              │
│      │   v2    │   │   v2    │   │   v1    │              │
│      │  10%    │   │  10%    │   │  80%    │              │
│      └─────────┘   └─────────┘   └─────────┘              │
│           │             │             │                     │
│           └─────────────┼─────────────┘                     │
│                         ▼                                    │
│              Monitoring & Metrics                             │
│                         │                                    │
│              Promote or Rollback                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Kubernetes Canary

# kubernetes/canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
      version: canary
  template:
    spec:
      containers:
        - name: myapp
          image: myapp:v2
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080

Argo Rollouts

# argo-rollouts.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 10m}
        - setWeight: 30
        - pause: {duration: 10m}
        - setWeight: 100
      canaryMetadata:
        labels:
          role: canary
      stableMetadata:
        labels:
          role: stable

Monitoring for Issues

Key Metrics

# Metrics to monitor
metrics:
  - "Error rate (should stay low)"
  - "Latency (p50, p95, p99)"
  - "HTTP status codes (4xx, 5xx)"
  - "Business metrics (conversions, signups)"
  - "User feedback"

Automated Rollback

// Automated canary analysis
async function analyzeCanary() {
  const metrics = await getMetrics('canary');
  
  const errorRate = metrics.errors / metrics.requests;
  const latencyP99 = metrics.latency.p99;
  
  // Rollback if error rate > 1%
  if (errorRate > 0.01) {
    await rollbackCanary();
    await alert('Canary rolled back - error rate exceeded 1%');
    return;
  }
  
  // Rollback if latency increased > 50%
  if (latencyP99 > baselineLatency * 1.5) {
    await rollbackCanary();
    await alert('Canary rolled back - latency degradation');
    return;
  }
  
  // Promote if metrics look good
  await promoteCanary();
}

Advanced A/B Testing and Statistical Analysis

Experiment Design

Proper A/B testing requires understanding statistical fundamentals:

interface ExperimentConfig {
  name: string;
  variants: Variant[];
  minimumDetectableEffect: number;  // e.g., 0.05 for 5% improvement
  significanceLevel: number;        // e.g., 0.05 for 95% confidence
  statisticalPower: number;         // e.g., 0.80 for 80% power
}

function calculateSampleSize(config: ExperimentConfig): number {
  const { minimumDetectableEffect, significanceLevel, statisticalPower } = config;
  const zAlpha = 1.96;  // z-score for 95% confidence
  const zBeta = 0.84;   // z-score for 80% power

  // Assume baseline conversion rate of 5%
  const baselineRate = 0.05;
  const variantRate = baselineRate * (1 + minimumDetectableEffect);
  const pooledRate = (baselineRate + variantRate) / 2;

  const sampleSize =
    Math.pow(zAlpha + zBeta, 2) *
    (baselineRate * (1 - baselineRate) + variantRate * (1 - variantRate)) /
    Math.pow(variantRate - baselineRate, 2);

  return Math.ceil(sampleSize);
}

// Expected vs actual results determine experiment duration
const requiredSample = calculateSampleSize({
  name: "checkout-redesign",
  variants: [{ name: "control" }, { name: "variant" }],
  minimumDetectableEffect: 0.1,  // Detect 10% relative change
  significanceLevel: 0.05,
  statisticalPower: 0.80,
});
// With 10,000 daily visitors, need ~7 days per variant

Bayesian Analysis for Faster Results

Bayesian methods provide more intuitive results than frequentist p-values:

class BayesianABTest {
  // Beta-Binomial model for conversion rates
  evaluate(control: { conversions: number; visitors: number },
           variant: { conversions: number; visitors: number }) {
    // Simulate posterior distributions using Beta distribution
    const simulations = 100000;
    let variantWins = 0;

    for (let i = 0; i < simulations; i++) {
      const controlRate = this.sampleBeta(
        control.conversions + 1,
        control.visitors - control.conversions + 1
      );
      const variantRate = this.sampleBeta(
        variant.conversions + 1,
        variant.visitors - variant.conversions + 1
      );

      if (variantRate > controlRate) {
        variantWins++;
      }
    }

    return {
      probabilityVariantIsBetter: variantWins / simulations,
      controlRate: control.conversions / control.visitors,
      variantRate: variant.conversions / variant.visitors,
      lift: ((variant.conversions / variant.visitors) /
             (control.conversions / control.visitors) - 1) * 100
    };
  }

  private sampleBeta(alpha: number, beta: number): number {
    // Marsaglia-Tsang method for Beta sampling
    const u1 = Math.random();
    const u2 = Math.random();
    const x = Math.pow(u1, 1 / alpha);
    const y = Math.pow(u2, 1 / beta);
    return x / (x + y);
  }
}

Shadow Testing: Test with Production Traffic

Shadow testing (dark launching) sends production traffic to a new service without affecting users:

interface ShadowConfig {
  enabled: boolean;
  captureRate: number;  // 0.0 to 1.0
  shadowService: string;
  timeout: number;      // milliseconds
}

class ShadowTester {
  async testEndpoint(
    originalRequest: Request,
    shadowConfig: ShadowConfig
  ): Promise<Response> {
    // Always serve the original response
    const originalResponse = await this.handle(originalRequest);

    // Sample traffic for shadow testing
    if (Math.random() < shadowConfig.captureRate) {
      // Fire and forget: shadow request with timeout
      this.shadowRequest(originalRequest, shadowConfig).catch(err => {
        console.error(`Shadow test failed: ${err.message}`);
        // Never fail the original request
      });
    }

    return originalResponse;
  }

  private async shadowRequest(
    originalRequest: Request,
    config: ShadowConfig
  ): Promise<void> {
    const start = performance.now();

    try {
      const shadowResponse = await fetch(config.shadowService, {
        method: originalRequest.method,
        headers: originalRequest.headers,
        body: originalRequest.body,
        signal: AbortSignal.timeout(config.timeout)
      });

      const latency = performance.now() - start;

      // Compare responses
      await this.recordComparison({
        statusMatch: originalResponse.status === shadowResponse.status,
        latencyMs: latency,
        shadowStatus: shadowResponse.status,
        originalStatus: originalResponse.status
      });
    } catch (error) {
      await this.recordError(error);
    }
  }

  private async recordComparison(data: any): Promise<void> {
    // Store in time-series database for analysis
    await metricsClient.increment("shadow_test.comparison", data);
  }
}

Synthetic Production Monitoring

Run synthetic transactions against production to detect issues before real users do:

class SyntheticMonitor {
  async runCheck(): Promise<CheckResult> {
    const checks = [
      this.checkHealthEndpoint(),
      this.checkCriticalFlow(),
      this.checkLatency(),
      this.checkDatabaseConnectivity(),
    ];

    const results = await Promise.allSettled(checks);
    const failures = results.filter(r => r.status === 'rejected');

    if (failures.length > 0) {
      await this.alertOnFailure(failures);
      return { passed: false, failures };
    }

    return { passed: true, failures: [] };
  }

  private async checkCriticalFlow(): Promise<void> {
    // Simulate a complete user journey
    const session = await this.createSession();
    const product = await this.searchProduct(session, "widget");
    const cart = await this.addToCart(session, product.id);
    const order = await this.checkout(session, cart.id);

    // If we got an order ID without errors, the flow works
    if (!order.id) {
      throw new Error("Critical checkout flow failed");
    }
  }

  private async checkLatency(): Promise<void> {
    const thresholds = {
      p50: { max: 200 },  // milliseconds
      p95: { max: 500 },
      p99: { max: 1000 },
    };

    const latencies = await this.getRecentLatencies("checkout", 1000);
    const sorted = [...latencies].sort((a, b) => a - b);

    if (sorted[Math.floor(sorted.length * 0.99)] > thresholds.p99.max) {
      throw new Error(`p99 latency exceeded ${thresholds.p99.max}ms`);
    }
  }

  private async alertOnFailure(failures: PromiseRejectedResult[]): Promise<void> {
    await alertingClient.send({
      severity: "critical",
      title: "Synthetic monitor failure",
      detail: `${failures.length} checks failed`,
      timestamp: new Date().toISOString(),
    });
  }
}

Blue-Green Deployment with Traffic Mirroring

Blue-green deployments minimize risk by running two identical environments:

# Kubernetes blue-green deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      color: blue
  template:
    metadata:
      labels:
        app: myapp
        color: blue
    spec:
      containers:
        - name: app
          image: myapp:v2.0.0  # New version
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      color: green
  template:
    metadata:
      labels:
        app: myapp
        color: green
    spec:
      containers:
        - name: app
          image: myapp:v1.9.9  # Old version
---
# Traffic mirroring: send copy of real traffic to blue
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
    - myapp
  http:
    - route:
        - destination:
            host: myapp-green
          weight: 100
      mirror:
        host: myapp-blue
      mirrorPercentage:
        value: 10.0  # Mirror 10% of traffic

Feature Flag Lifecycle Management

Flags accumulate technical debt if not cleaned up:

class FlagLifecycle {
  private flags: Map<string, FlagDefinition> = new Map();

  registerFlag(flag: FlagDefinition): void {
    flag.createdAt = new Date();
    flag.status = 'active';
    this.flags.set(flag.name, flag);
  }

  async enforceCleanup(): Promise<void> {
    const now = new Date();

    for (const [name, flag] of this.flags) {
      // Check expiration
      if (flag.expiresAt && now > flag.expiresAt) {
        if (flag.rolloutPercentage === 100) {
          // Fully rolled out — remove flag code
          await this.scheduleRemoval(flag);
        } else {
          // Expired but not fully rolled out — alert
          await this.alertExpiredFlag(flag);
        }
      }

      // Check staleness
      const age = now.getTime() - flag.createdAt.getTime();
      if (age > 90 * 24 * 60 * 60 * 1000) {  // 90 days
        await this.alertStaleFlag(flag, age);
      }
    }
  }

  private async scheduleRemoval(flag: FlagDefinition): Promise<void> {
    // Create a ticket/issue for flag removal
    await issueTracker.create({
      title: `Remove feature flag: ${flag.name}`,
      description: `Flag ${flag.name} is at 100% rollout and should be removed from codebase.`,
      labels: ['flag-cleanup', 'tech-debt'],
      priority: 'medium'
    });
  }

  async generateFlagReport(): Promise<FlagReport> {
    const total = this.flags.size;
    const active = [...this.flags.values()].filter(f => f.status === 'active').length;
    const stale = [...this.flags.values()].filter(f => {
      const age = new Date().getTime() - f.createdAt.getTime();
      return age > 90 * 24 * 60 * 60 * 1000;
    }).length;

    return {
      totalFlags: total,
      activeFlags: active,
      staleFlags: stale,
      cleanupRate: total > 0 ? ((total - stale) / total) * 100 : 100,
      oldestFlag: [...this.flags.entries()]
        .sort((a, b) => a[1].createdAt.getTime() - b[1].createdAt.getTime())[0]?.[0]
    };
  }
}

// Centralized flag management
const flagDashboard = {
  productionFlags: [
    { name: "new-checkout", rollout: 100, age: 45, status: "cleanup-ready" },
    { name: "dark-mode", rollout: 50, age: 120, status: "stale-review" },
    { name: "ai-recs", rollout: 10, age: 30, status: "active-experiment" },
    { name: "legacy-ui", rollout: 0, age: 200, status: "deprecated-remove" },
  ]
};

Production Canary Analysis with Prometheus

Automate canary promotion decisions with metrics-based analysis:

# Argo Rollouts analysis template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-analysis
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"2.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[5m]))

    - name: error-rate
      interval: 1m
      successCondition: result[0] <= 0.01
      failureLimit: 2
      provider:
        prometheus:
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"5.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[5m]))

    - name: latency-p99
      interval: 1m
      successCondition: result[0] <= 500
      failureLimit: 3
      provider:
        prometheus:
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service-name}}"
              }[5m])) by (le))

Key Takeaways

Feature flags - Toggle features without deploying
Canary releases - Test with small percentage first
Monitor metrics - Error rate, latency, business metrics
Automated rollback - React quickly to issues
Shadow testing - Validate new services with real traffic, zero user impact
Synthetic monitoring - Detect failures before users notice
Bayesian A/B testing - More intuitive than frequentist, faster decisions
Flag lifecycle - Clean up flags to prevent technical debt

Testing in Production: Feature Flags and Canary Releases

Introduction

Why Test in Production?

Feature Flags

Basic Implementation

With Providers

Gradual Rollout

Canary Deployments

Architecture

Kubernetes Canary

Argo Rollouts

Monitoring for Issues

Key Metrics

Automated Rollback

Advanced A/B Testing and Statistical Analysis

Experiment Design

Bayesian Analysis for Faster Results

Shadow Testing: Test with Production Traffic

Synthetic Production Monitoring

Blue-Green Deployment with Traffic Mirroring

Feature Flag Lifecycle Management

Production Canary Analysis with Prometheus

Key Takeaways

External Resources

Comments

Share this article

👍 Was this article helpful?