Introduction
In distributed systems, a single failing service can trigger a cascade of failures across your entire infrastructure. When Service A calls Service B, and Service B is slow or unresponsive, Service A’s threads get blocked waiting for responses. As requests pile up, Service A exhausts its thread pool and becomes unresponsive itself. Now Service C, which depends on Service A, starts failing too. Within minutes, your entire system is down.
The circuit breaker pattern prevents this cascade by monitoring calls to external services and “opening the circuit” when failures exceed a threshold. Once open, the circuit breaker immediately rejects requests without attempting the call, giving the failing service time to recover while keeping your system responsive.
This pattern is named after electrical circuit breakers in your home — when too much current flows through, the breaker trips and stops the flow to prevent damage.
How Circuit Breakers Work
A circuit breaker wraps calls to external services and tracks their success and failure rates. It operates in three states:
%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'16px', 'fontFamily':'system-ui'}}}%%
graph LR
Start([Start]) --> Closed[CLOSED]
Closed -->|Failures exceed threshold| Open[OPEN]
Open -->|Timeout expires| HalfOpen[HALF-OPEN]
HalfOpen -->|Tests succeed| Closed
HalfOpen -->|Tests fail| Open
style Closed fill:#10b981,stroke:#059669,stroke-width:3px,color:#fff
style Open fill:#ef4444,stroke:#dc2626,stroke-width:3px,color:#fff
style HalfOpen fill:#f59e0b,stroke:#d97706,stroke-width:3px,color:#fff
style Start fill:#6366f1,stroke:#4f46e5,stroke-width:2px,color:#fff
State Descriptions:
- CLOSED: Normal operation - all requests pass through to the service
- OPEN: Failing fast - requests are rejected immediately without calling the service
- HALF-OPEN: Testing recovery - limited test requests are allowed through
State Transitions
Closed State (Normal Operation)
- All requests pass through to the downstream service
- Circuit breaker tracks success and failure rates
- If failure rate exceeds threshold, transition to Open
Open State (Failing Fast)
- All requests fail immediately without calling the service
- Returns a predefined error or fallback response
- After a timeout period, transition to Half-Open
Half-Open State (Testing Recovery)
- Allow a limited number of test requests through
- If test requests succeed, transition back to Closed
- If test requests fail, return to Open state
Core Concepts
Failure Threshold
The circuit breaker monitors a sliding window of recent requests and calculates the failure rate. When failures exceed the configured threshold, the circuit opens.
// Resilience4j configuration
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open at 50% failure rate
.slidingWindowSize(10) // Track last 10 calls
.minimumNumberOfCalls(5) // Need 5 calls before calculating rate
.waitDurationInOpenState(Duration.ofSeconds(30)) // Stay open for 30s
.permittedNumberOfCallsInHalfOpenState(3) // Allow 3 test calls
.build();
Sliding Window
Circuit breakers use either count-based or time-based sliding windows to aggregate call outcomes:
Count-Based Window: Tracks the last N calls
- Simple and predictable
- Good for high-traffic services
- Example: Last 100 requests
Time-Based Window: Tracks calls within a time period
- Better for variable traffic patterns
- Adapts to traffic spikes
- Example: Last 60 seconds
Timeout and Recovery
After opening, the circuit breaker waits for a configured duration before transitioning to Half-Open. This gives the failing service time to recover without being overwhelmed by requests.
// Polly (.NET) configuration
var circuitBreakerPolicy = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (exception, duration) => {
Console.WriteLine($"Circuit opened for {duration.TotalSeconds}s");
},
onReset: () => {
Console.WriteLine("Circuit closed, service recovered");
}
);
Implementation Examples
Java with Resilience4j
Resilience4j is the modern replacement for Netflix Hystrix (now in maintenance mode). It’s lightweight, functional, and designed for Java 8+.
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
import java.util.function.Supplier;
public class PaymentService {
private final CircuitBreaker circuitBreaker;
private final ExternalPaymentAPI paymentAPI;
public PaymentService(ExternalPaymentAPI paymentAPI) {
this.paymentAPI = paymentAPI;
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(IllegalArgumentException.class)
.build();
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
this.circuitBreaker = registry.circuitBreaker("paymentService");
// Register event listeners
circuitBreaker.getEventPublisher()
.onStateTransition(event ->
System.out.println("Circuit breaker state: " + event.getStateTransition())
)
.onError(event ->
System.out.println("Call failed: " + event.getThrowable().getMessage())
);
}
public PaymentResult processPayment(PaymentRequest request) {
Supplier<PaymentResult> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> paymentAPI.charge(request));
try {
return decoratedSupplier.get();
} catch (CallNotPermittedException e) {
// Circuit is open, return fallback
return PaymentResult.unavailable("Payment service temporarily unavailable");
} catch (Exception e) {
// Other errors
return PaymentResult.error("Payment failed: " + e.getMessage());
}
}
}
.NET with Polly
Polly is the standard resilience library for .NET applications.
using Polly;
using Polly.CircuitBreaker;
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class OrderService
{
private readonly HttpClient _httpClient;
private readonly AsyncCircuitBreakerPolicy<HttpResponseMessage> _circuitBreakerPolicy;
public OrderService(HttpClient httpClient)
{
_httpClient = httpClient;
_circuitBreakerPolicy = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.Or<HttpRequestException>()
.Or<TaskCanceledException>()
.AdvancedCircuitBreakerAsync(
failureThreshold: 0.5, // Open at 50% failure rate
samplingDuration: TimeSpan.FromSeconds(10),
minimumThroughput: 5, // Need 5 calls in window
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (result, duration) =>
{
Console.WriteLine($"Circuit opened for {duration.TotalSeconds}s");
},
onReset: () =>
{
Console.WriteLine("Circuit closed");
},
onHalfOpen: () =>
{
Console.WriteLine("Circuit half-open, testing...");
}
);
}
public async Task<Order> GetOrderAsync(string orderId)
{
try
{
var response = await _circuitBreakerPolicy.ExecuteAsync(async () =>
await _httpClient.GetAsync($"/api/orders/{orderId}")
);
if (response.IsSuccessStatusCode)
{
return await response.Content.ReadAsAsync<Order>();
}
return Order.NotFound(orderId);
}
catch (BrokenCircuitException)
{
// Circuit is open, return cached data or default
return await GetCachedOrderAsync(orderId)
?? Order.Unavailable(orderId);
}
}
private async Task<Order> GetCachedOrderAsync(string orderId)
{
// Return cached data if available
return null;
}
}
Go Implementation
Go doesn’t have a standard circuit breaker library, but implementing one is straightforward.
package circuitbreaker
import (
"errors"
"sync"
"time"
)
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
var (
ErrCircuitOpen = errors.New("circuit breaker is open")
)
type CircuitBreaker struct {
maxFailures int
timeout time.Duration
halfOpenMax int
mu sync.RWMutex
state State
failures int
lastFailTime time.Time
halfOpenCalls int
}
func New(maxFailures int, timeout time.Duration, halfOpenMax int) *CircuitBreaker {
return &CircuitBreaker{
maxFailures: maxFailures,
timeout: timeout,
halfOpenMax: halfOpenMax,
state: StateClosed,
}
}
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
// Check if we should transition from Open to Half-Open
if cb.state == StateOpen {
if time.Since(cb.lastFailTime) > cb.timeout {
cb.state = StateHalfOpen
cb.halfOpenCalls = 0
} else {
cb.mu.Unlock()
return ErrCircuitOpen
}
}
// Reject if Half-Open and already testing
if cb.state == StateHalfOpen && cb.halfOpenCalls >= cb.halfOpenMax {
cb.mu.Unlock()
return ErrCircuitOpen
}
if cb.state == StateHalfOpen {
cb.halfOpenCalls++
}
cb.mu.Unlock()
// Execute the function
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.onFailure()
return err
}
cb.onSuccess()
return nil
}
func (cb *CircuitBreaker) onSuccess() {
if cb.state == StateHalfOpen {
// Successful test call, close the circuit
cb.state = StateClosed
cb.failures = 0
cb.halfOpenCalls = 0
} else if cb.state == StateClosed {
// Reset failure count on success
cb.failures = 0
}
}
func (cb *CircuitBreaker) onFailure() {
cb.failures++
cb.lastFailTime = time.Now()
if cb.state == StateHalfOpen {
// Test failed, reopen circuit
cb.state = StateOpen
cb.halfOpenCalls = 0
} else if cb.failures >= cb.maxFailures {
// Too many failures, open circuit
cb.state = StateOpen
}
}
func (cb *CircuitBreaker) State() State {
cb.mu.RLock()
defer cb.mu.RUnlock()
return cb.state
}
// Usage example
func main() {
cb := New(5, 30*time.Second, 3)
err := cb.Call(func() error {
// Call external service
resp, err := http.Get("https://api.example.com/data")
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return errors.New("service returned error")
}
return nil
})
if err == ErrCircuitOpen {
// Circuit is open, use fallback
fmt.Println("Service unavailable, using cached data")
} else if err != nil {
// Other error
fmt.Printf("Request failed: %v\n", err)
}
}
Best Practices
1. Configure Thresholds Based on Traffic Patterns
Don’t use the same configuration for all services. High-traffic services need larger sliding windows, while low-traffic services need smaller ones.
// High-traffic service (1000+ req/min)
CircuitBreakerConfig highTraffic = CircuitBreakerConfig.custom()
.slidingWindowSize(100)
.minimumNumberOfCalls(20)
.failureRateThreshold(50)
.build();
// Low-traffic service (10-50 req/min)
CircuitBreakerConfig lowTraffic = CircuitBreakerConfig.custom()
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.failureRateThreshold(60)
.build();
2. Implement Fallback Strategies
When the circuit is open, don’t just return errors. Provide fallback responses:
- Return cached data
- Return default values
- Degrade functionality gracefully
- Queue requests for later processing
public UserProfile getUserProfile(String userId) {
try {
return circuitBreaker.executeSupplier(() ->
userService.fetchProfile(userId)
);
} catch (CallNotPermittedException e) {
// Circuit open, try cache
UserProfile cached = cache.get(userId);
if (cached != null) {
return cached.withStaleWarning();
}
// Return minimal profile
return UserProfile.minimal(userId);
}
}
3. Monitor Circuit Breaker State
Export circuit breaker metrics to your monitoring system. Track:
- State transitions (Closed → Open → Half-Open)
- Failure rates
- Call duration
- Number of rejected calls
// Resilience4j with Micrometer
CircuitBreaker circuitBreaker = registry.circuitBreaker("paymentService");
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(registry)
.bindTo(meterRegistry);
// Metrics available:
// - resilience4j.circuitbreaker.state (0=closed, 1=open, 2=half_open)
// - resilience4j.circuitbreaker.calls (success, failure, not_permitted)
// - resilience4j.circuitbreaker.failure.rate
4. Use Different Timeouts for Different Failure Types
Not all failures are equal. Network timeouts might need longer recovery periods than HTTP 500 errors.
var policy = Policy
.Handle<TimeoutException>()
.CircuitBreakerAsync(3, TimeSpan.FromMinutes(2)) // Long timeout for network issues
.WrapAsync(
Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)) // Short timeout for HTTP errors
);
5. Combine with Retry and Timeout Policies
Circuit breakers work best when combined with other resilience patterns:
// Resilience4j combining multiple patterns
Retry retry = Retry.of("paymentService", RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.build());
TimeLimiter timeLimiter = TimeLimiter.of(TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(2))
.build());
Supplier<PaymentResult> decoratedSupplier = Decorators
.ofSupplier(() -> paymentAPI.charge(request))
.withCircuitBreaker(circuitBreaker)
.withRetry(retry)
.withTimeLimiter(timeLimiter)
.decorate();
Common Pitfalls
1. Setting Thresholds Too Low
Opening the circuit too aggressively can cause unnecessary service degradation. A few failures in a high-traffic system are normal.
Bad: Open after 2 failures
.failureRateThreshold(20) // Opens at 20% failure rate
.minimumNumberOfCalls(2) // With only 2 calls
Good: Require statistical significance
.failureRateThreshold(50) // Opens at 50% failure rate
.minimumNumberOfCalls(10) // Need at least 10 calls
2. Not Handling Circuit Open State
Failing to provide fallbacks when the circuit is open defeats the purpose of the pattern.
Bad: Propagate the error
public Data getData() {
return circuitBreaker.executeSupplier(() -> api.fetch());
// Throws CallNotPermittedException when open
}
Good: Provide fallback
public Data getData() {
try {
return circuitBreaker.executeSupplier(() -> api.fetch());
} catch (CallNotPermittedException e) {
return cache.getOrDefault(Data.empty());
}
}
3. Sharing Circuit Breakers Across Different Endpoints
Each external dependency should have its own circuit breaker. Sharing one breaker across multiple endpoints means one failing endpoint can block all others.
Bad: One breaker for entire service
CircuitBreaker breaker = registry.circuitBreaker("externalService");
breaker.executeSupplier(() -> api.getUsers());
breaker.executeSupplier(() -> api.getOrders()); // Blocked if getUsers fails
Good: Separate breakers per endpoint
CircuitBreaker userBreaker = registry.circuitBreaker("externalService-users");
CircuitBreaker orderBreaker = registry.circuitBreaker("externalService-orders");
userBreaker.executeSupplier(() -> api.getUsers());
orderBreaker.executeSupplier(() -> api.getOrders()); // Independent
4. Ignoring Slow Calls
Circuit breakers should track not just failures, but also slow calls that tie up resources.
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(50) // Also track slow calls
.slowCallDurationThreshold(Duration.ofSeconds(3)) // >3s is slow
.build();
Service Mesh Integration
Modern service meshes like Istio and Linkerd implement circuit breakers at the infrastructure level, removing the need for application-level libraries.
Istio Circuit Breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
maxRequestsPerConnection: 2
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 50
Advantages:
- No code changes required
- Consistent behavior across all services
- Centralized configuration
- Language-agnostic
Disadvantages:
- Less fine-grained control
- Harder to test locally
- Requires service mesh infrastructure
Testing Circuit Breakers
Unit Testing
@Test
public void testCircuitOpensAfterFailures() {
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slidingWindowSize(4)
.minimumNumberOfCalls(4)
.build();
CircuitBreaker breaker = CircuitBreaker.of("test", config);
// Simulate 3 failures
for (int i = 0; i < 3; i++) {
try {
breaker.executeSupplier(() -> {
throw new RuntimeException("Service down");
});
} catch (Exception e) {
// Expected
}
}
// Circuit should still be closed (need 4 calls minimum)
assertEquals(CircuitBreaker.State.CLOSED, breaker.getState());
// One more failure should open it
try {
breaker.executeSupplier(() -> {
throw new RuntimeException("Service down");
});
} catch (Exception e) {
// Expected
}
// Now circuit should be open
assertEquals(CircuitBreaker.State.OPEN, breaker.getState());
// Next call should fail fast
assertThrows(CallNotPermittedException.class, () -> {
breaker.executeSupplier(() -> "success");
});
}
Integration Testing with Chaos Engineering
Use tools like Chaos Monkey or Toxiproxy to simulate service failures and verify circuit breaker behavior.
# Toxiproxy: Add latency to payment service
toxiproxy-cli toxic add payment-service -t latency -a latency=5000
# Verify circuit breaker opens after threshold
curl http://localhost:8080/api/orders/123
# Should return fallback response after circuit opens
Python Implementation
import time
import threading
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar, Generic
import functools
T = TypeVar('T')
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerConfig:
failure_threshold: int = 5
success_threshold: int = 3
timeout: float = 60.0
half_open_max_calls: int = 3
class CircuitBreakerOpen(Exception):
pass
class CircuitBreaker:
def __init__(self, name: str, config: CircuitBreakerConfig = None):
self.name = name
self.config = config or CircuitBreakerConfig()
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.lock = threading.RLock()
self.call_count = 0
def call(self, func: Callable[..., T], *args, **kwargs) -> T:
with self.lock:
self._check_state()
if self.state == CircuitState.OPEN:
raise CircuitBreakerOpen(
f"Circuit {self.name} is OPEN. Failing fast."
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _check_state(self):
if self.state == CircuitState.OPEN:
if (self.last_failure_time and
time.time() - self.last_failure_time > self.config.timeout):
self._transition_to_half_open()
def _transition_to_half_open(self):
print(f"[Circuit {self.name}] OPEN -> HALF_OPEN")
self.state = CircuitState.HALF_OPEN
self.call_count = 0
self.success_count = 0
def _on_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
self.call_count += 1
if (self.success_count >= self.config.success_threshold or
self.call_count >= self.config.half_open_max_calls):
self._transition_to_closed()
elif self.state == CircuitState.CLOSED:
self.failure_count = 0
def _on_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self._transition_to_open()
elif (self.state == CircuitState.CLOSED and
self.failure_count >= self.config.failure_threshold):
self._transition_to_open()
def _transition_to_open(self):
print(f"[Circuit {self.name}] {self.state.value} -> OPEN")
self.state = CircuitState.OPEN
self.failure_count = 0
def _transition_to_closed(self):
print(f"[Circuit {self.name}] {self.state.value} -> CLOSED")
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
def get_state(self) -> dict:
with self.lock:
return {
"name": self.name,
"state": self.state.value,
"failure_count": self.failure_count,
"success_count": self.success_count,
"last_failure_time": self.last_failure_time,
}
def circuit_breaker(circuit_name: str, config: CircuitBreakerConfig = None):
breaker = CircuitBreaker(circuit_name, config)
def decorator(func: Callable) -> Callable:
@functools.wraps(func)
def wrapper(*args, **kwargs):
return breaker.call(func, *args, **kwargs)
wrapper.circuit_breaker = breaker
return wrapper
return decorator
Tools and Libraries
| Language | Library | Features |
|---|---|---|
| Java | Resilience4j | Circuit breaker, retry, rate limiter |
| .NET | Polly | Circuit breaker, retry, bulkhead |
| Python | pybreaker | Circuit breaker implementation |
| Go | gobreaker | Circuit breaker pattern |
| Node.js | opossum | Circuit breaker for Node.js |
| Ruby | breaker_machines | Modern circuit breaker |
AWS Lambda Considerations
For serverless architectures, the circuit breaker pattern must be adapted for the ephemeral, stateless nature of functions:
def lambda_handler(event, context):
try:
result = call_external_service()
return {"statusCode": 200, "body": result}
except CircuitBreakerOpenError:
return {"statusCode": 200, "body": get_fallback_data()}
except Exception as e:
logger.error(f"Error: {str(e)}")
return {"statusCode": 500, "body": "Internal error"}
Resources
- Resilience4j Documentation — Modern Java resilience library
- Polly Documentation — .NET resilience and transient-fault-handling library
- AWS Prescriptive Guidance: Circuit Breaker Pattern — Cloud implementation patterns
- Martin Fowler: Circuit Breaker — Original pattern description
- Microsoft: Implement Circuit Breaker Pattern — .NET implementation guide
Comments