Introduction
Testing in production sounds risky, but it’s often the only way to find real-world bugs. Modern practices like feature flags and canary releases let you test safely in production with minimal risk.
Why Test in Production?
┌─────────────────────────────────────────────────────────────┐
│ Testing in Production Benefits │
├─────────────────────────────────────────────────────────────┤
│ │
│ ✓ Real users, real data, real conditions │
│ ✓ Find issues CI can't catch │
│ ✓ Faster feedback loops │
│ ✓ A/B test new features │
│ ✓ Instant rollback if issues occur │
│ │
│ Risks: │
│ ✗ Users affected by bugs │
│ ✗ Potential service disruptions │
│ │
│ Solution: FEATURE FLAGS + CANARY │
│ │
└─────────────────────────────────────────────────────────────┘
Feature Flags
Basic Implementation
// Feature flag service
class FeatureFlags {
private flags = new Map<string, boolean>();
enable(feature: string) {
this.flags.set(feature, true);
}
disable(feature: string) {
this.flags.set(feature, false);
}
isEnabled(feature: string): boolean {
return this.flags.get(feature) ?? false;
}
}
const flags = new FeatureFlags();
// Usage in code
if (flags.isEnabled('new-dashboard')) {
return <NewDashboard />;
} else {
return <LegacyDashboard />;
}
With Providers
// Use LaunchDarkly, Split, or Statsig
import { LaunchDarkly } from 'launchdarkly-node-server-sdk';
const client = LaunchDarkly.init(process.env.LD_KEY!);
// Check feature flag
async function checkFlag(userId: string, flag: string) {
const value = await client.variation(flag, {
key: userId
}, false);
return value;
}
// In Express route
app.get('/dashboard', async (req, res) => {
const userId = req.user.id;
const useNewDashboard = await checkFlag(userId, 'new-dashboard');
if (useNewDashboard) {
return res.render('dashboard-new');
}
return res.render('dashboard-legacy');
});
Gradual Rollout
// Percentage rollout
async function isInRollout(userId: string, percentage: number): Promise<boolean> {
// Simple hash-based deterministic selection
const hash = hashCode(userId);
const bucket = Math.abs(hash) % 100;
return bucket < percentage;
}
// Usage
const rollout = await isInRollout(userId, 10); // 10% rollout
if (rollout) {
enableFeature('new-checkout');
}
Canary Deployments
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Canary Deployment │
├─────────────────────────────────────────────────────────────┤
│ │
│ Load Balancer │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ │ │ │ │
│ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │
│ │ Canary │ │ Canary │ │ Main │ │
│ │ v2 │ │ v2 │ │ v1 │ │
│ │ 10% │ │ 10% │ │ 80% │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ └─────────────┼─────────────┘ │
│ ▼ │
│ Monitoring & Metrics │
│ │ │
│ Promote or Rollback │
│ │
└─────────────────────────────────────────────────────────────┘
Kubernetes Canary
# kubernetes/canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
spec:
replicas: 1
selector:
matchLabels:
app: myapp
version: canary
template:
spec:
containers:
- name: myapp
image: myapp:v2
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
ports:
- port: 80
targetPort: 8080
Argo Rollouts
# argo-rollouts.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 10m}
- setWeight: 30
- pause: {duration: 10m}
- setWeight: 100
canaryMetadata:
labels:
role: canary
stableMetadata:
labels:
role: stable
Monitoring for Issues
Key Metrics
# Metrics to monitor
metrics:
- "Error rate (should stay low)"
- "Latency (p50, p95, p99)"
- "HTTP status codes (4xx, 5xx)"
- "Business metrics (conversions, signups)"
- "User feedback"
Automated Rollback
// Automated canary analysis
async function analyzeCanary() {
const metrics = await getMetrics('canary');
const errorRate = metrics.errors / metrics.requests;
const latencyP99 = metrics.latency.p99;
// Rollback if error rate > 1%
if (errorRate > 0.01) {
await rollbackCanary();
await alert('Canary rolled back - error rate exceeded 1%');
return;
}
// Rollback if latency increased > 50%
if (latencyP99 > baselineLatency * 1.5) {
await rollbackCanary();
await alert('Canary rolled back - latency degradation');
return;
}
// Promote if metrics look good
await promoteCanary();
}
Advanced A/B Testing and Statistical Analysis
Experiment Design
Proper A/B testing requires understanding statistical fundamentals:
interface ExperimentConfig {
name: string;
variants: Variant[];
minimumDetectableEffect: number; // e.g., 0.05 for 5% improvement
significanceLevel: number; // e.g., 0.05 for 95% confidence
statisticalPower: number; // e.g., 0.80 for 80% power
}
function calculateSampleSize(config: ExperimentConfig): number {
const { minimumDetectableEffect, significanceLevel, statisticalPower } = config;
const zAlpha = 1.96; // z-score for 95% confidence
const zBeta = 0.84; // z-score for 80% power
// Assume baseline conversion rate of 5%
const baselineRate = 0.05;
const variantRate = baselineRate * (1 + minimumDetectableEffect);
const pooledRate = (baselineRate + variantRate) / 2;
const sampleSize =
Math.pow(zAlpha + zBeta, 2) *
(baselineRate * (1 - baselineRate) + variantRate * (1 - variantRate)) /
Math.pow(variantRate - baselineRate, 2);
return Math.ceil(sampleSize);
}
// Expected vs actual results determine experiment duration
const requiredSample = calculateSampleSize({
name: "checkout-redesign",
variants: [{ name: "control" }, { name: "variant" }],
minimumDetectableEffect: 0.1, // Detect 10% relative change
significanceLevel: 0.05,
statisticalPower: 0.80,
});
// With 10,000 daily visitors, need ~7 days per variant
Bayesian Analysis for Faster Results
Bayesian methods provide more intuitive results than frequentist p-values:
class BayesianABTest {
// Beta-Binomial model for conversion rates
evaluate(control: { conversions: number; visitors: number },
variant: { conversions: number; visitors: number }) {
// Simulate posterior distributions using Beta distribution
const simulations = 100000;
let variantWins = 0;
for (let i = 0; i < simulations; i++) {
const controlRate = this.sampleBeta(
control.conversions + 1,
control.visitors - control.conversions + 1
);
const variantRate = this.sampleBeta(
variant.conversions + 1,
variant.visitors - variant.conversions + 1
);
if (variantRate > controlRate) {
variantWins++;
}
}
return {
probabilityVariantIsBetter: variantWins / simulations,
controlRate: control.conversions / control.visitors,
variantRate: variant.conversions / variant.visitors,
lift: ((variant.conversions / variant.visitors) /
(control.conversions / control.visitors) - 1) * 100
};
}
private sampleBeta(alpha: number, beta: number): number {
// Marsaglia-Tsang method for Beta sampling
const u1 = Math.random();
const u2 = Math.random();
const x = Math.pow(u1, 1 / alpha);
const y = Math.pow(u2, 1 / beta);
return x / (x + y);
}
}
Shadow Testing: Test with Production Traffic
Shadow testing (dark launching) sends production traffic to a new service without affecting users:
interface ShadowConfig {
enabled: boolean;
captureRate: number; // 0.0 to 1.0
shadowService: string;
timeout: number; // milliseconds
}
class ShadowTester {
async testEndpoint(
originalRequest: Request,
shadowConfig: ShadowConfig
): Promise<Response> {
// Always serve the original response
const originalResponse = await this.handle(originalRequest);
// Sample traffic for shadow testing
if (Math.random() < shadowConfig.captureRate) {
// Fire and forget: shadow request with timeout
this.shadowRequest(originalRequest, shadowConfig).catch(err => {
console.error(`Shadow test failed: ${err.message}`);
// Never fail the original request
});
}
return originalResponse;
}
private async shadowRequest(
originalRequest: Request,
config: ShadowConfig
): Promise<void> {
const start = performance.now();
try {
const shadowResponse = await fetch(config.shadowService, {
method: originalRequest.method,
headers: originalRequest.headers,
body: originalRequest.body,
signal: AbortSignal.timeout(config.timeout)
});
const latency = performance.now() - start;
// Compare responses
await this.recordComparison({
statusMatch: originalResponse.status === shadowResponse.status,
latencyMs: latency,
shadowStatus: shadowResponse.status,
originalStatus: originalResponse.status
});
} catch (error) {
await this.recordError(error);
}
}
private async recordComparison(data: any): Promise<void> {
// Store in time-series database for analysis
await metricsClient.increment("shadow_test.comparison", data);
}
}
Synthetic Production Monitoring
Run synthetic transactions against production to detect issues before real users do:
class SyntheticMonitor {
async runCheck(): Promise<CheckResult> {
const checks = [
this.checkHealthEndpoint(),
this.checkCriticalFlow(),
this.checkLatency(),
this.checkDatabaseConnectivity(),
];
const results = await Promise.allSettled(checks);
const failures = results.filter(r => r.status === 'rejected');
if (failures.length > 0) {
await this.alertOnFailure(failures);
return { passed: false, failures };
}
return { passed: true, failures: [] };
}
private async checkCriticalFlow(): Promise<void> {
// Simulate a complete user journey
const session = await this.createSession();
const product = await this.searchProduct(session, "widget");
const cart = await this.addToCart(session, product.id);
const order = await this.checkout(session, cart.id);
// If we got an order ID without errors, the flow works
if (!order.id) {
throw new Error("Critical checkout flow failed");
}
}
private async checkLatency(): Promise<void> {
const thresholds = {
p50: { max: 200 }, // milliseconds
p95: { max: 500 },
p99: { max: 1000 },
};
const latencies = await this.getRecentLatencies("checkout", 1000);
const sorted = [...latencies].sort((a, b) => a - b);
if (sorted[Math.floor(sorted.length * 0.99)] > thresholds.p99.max) {
throw new Error(`p99 latency exceeded ${thresholds.p99.max}ms`);
}
}
private async alertOnFailure(failures: PromiseRejectedResult[]): Promise<void> {
await alertingClient.send({
severity: "critical",
title: "Synthetic monitor failure",
detail: `${failures.length} checks failed`,
timestamp: new Date().toISOString(),
});
}
}
Blue-Green Deployment with Traffic Mirroring
Blue-green deployments minimize risk by running two identical environments:
# Kubernetes blue-green deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 5
selector:
matchLabels:
app: myapp
color: blue
template:
metadata:
labels:
app: myapp
color: blue
spec:
containers:
- name: app
image: myapp:v2.0.0 # New version
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 5
selector:
matchLabels:
app: myapp
color: green
template:
metadata:
labels:
app: myapp
color: green
spec:
containers:
- name: app
image: myapp:v1.9.9 # Old version
---
# Traffic mirroring: send copy of real traffic to blue
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp
http:
- route:
- destination:
host: myapp-green
weight: 100
mirror:
host: myapp-blue
mirrorPercentage:
value: 10.0 # Mirror 10% of traffic
Feature Flag Lifecycle Management
Flags accumulate technical debt if not cleaned up:
class FlagLifecycle {
private flags: Map<string, FlagDefinition> = new Map();
registerFlag(flag: FlagDefinition): void {
flag.createdAt = new Date();
flag.status = 'active';
this.flags.set(flag.name, flag);
}
async enforceCleanup(): Promise<void> {
const now = new Date();
for (const [name, flag] of this.flags) {
// Check expiration
if (flag.expiresAt && now > flag.expiresAt) {
if (flag.rolloutPercentage === 100) {
// Fully rolled out — remove flag code
await this.scheduleRemoval(flag);
} else {
// Expired but not fully rolled out — alert
await this.alertExpiredFlag(flag);
}
}
// Check staleness
const age = now.getTime() - flag.createdAt.getTime();
if (age > 90 * 24 * 60 * 60 * 1000) { // 90 days
await this.alertStaleFlag(flag, age);
}
}
}
private async scheduleRemoval(flag: FlagDefinition): Promise<void> {
// Create a ticket/issue for flag removal
await issueTracker.create({
title: `Remove feature flag: ${flag.name}`,
description: `Flag ${flag.name} is at 100% rollout and should be removed from codebase.`,
labels: ['flag-cleanup', 'tech-debt'],
priority: 'medium'
});
}
async generateFlagReport(): Promise<FlagReport> {
const total = this.flags.size;
const active = [...this.flags.values()].filter(f => f.status === 'active').length;
const stale = [...this.flags.values()].filter(f => {
const age = new Date().getTime() - f.createdAt.getTime();
return age > 90 * 24 * 60 * 60 * 1000;
}).length;
return {
totalFlags: total,
activeFlags: active,
staleFlags: stale,
cleanupRate: total > 0 ? ((total - stale) / total) * 100 : 100,
oldestFlag: [...this.flags.entries()]
.sort((a, b) => a[1].createdAt.getTime() - b[1].createdAt.getTime())[0]?.[0]
};
}
}
// Centralized flag management
const flagDashboard = {
productionFlags: [
{ name: "new-checkout", rollout: 100, age: 45, status: "cleanup-ready" },
{ name: "dark-mode", rollout: 50, age: 120, status: "stale-review" },
{ name: "ai-recs", rollout: 10, age: 30, status: "active-experiment" },
{ name: "legacy-ui", rollout: 0, age: 200, status: "deprecated-remove" },
]
};
Production Canary Analysis with Prometheus
Automate canary promotion decisions with metrics-based analysis:
# Argo Rollouts analysis template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-analysis
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"2.."
}[5m]))
/
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
- name: error-rate
interval: 1m
successCondition: result[0] <= 0.01
failureLimit: 2
provider:
prometheus:
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"5.."
}[5m]))
/
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
- name: latency-p99
interval: 1m
successCondition: result[0] <= 500
failureLimit: 3
provider:
prometheus:
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}"
}[5m])) by (le))
Key Takeaways
- Feature flags - Toggle features without deploying
- Canary releases - Test with small percentage first
- Monitor metrics - Error rate, latency, business metrics
- Automated rollback - React quickly to issues
- Shadow testing - Validate new services with real traffic, zero user impact
- Synthetic monitoring - Detect failures before users notice
- Bayesian A/B testing - More intuitive than frequentist, faster decisions
- Flag lifecycle - Clean up flags to prevent technical debt
External Resources
- LaunchDarkly
- Argo Rollouts
- Feature Flag Best Practices
- Netflix Automated Canary Analysis
- Google Analytics Testing Framework
Comments