Introduction
In distributed systems, a single failing service can trigger a cascade of failures across your entire infrastructure. When Service A calls Service B, and Service B is slow or unresponsive, Service A’s threads get blocked waiting for responses. As requests pile up, Service A exhausts its thread pool and becomes unresponsive itself. Now Service C, which depends on Service A, starts failing too. Within minutes, your entire system is down.
The circuit breaker pattern prevents this cascade by monitoring calls to external services and “opening the circuit” when failures exceed a threshold. Once open, the circuit breaker immediately rejects requests without attempting the call, giving the failing service time to recover while keeping your system responsive.
This pattern is named after electrical circuit breakers in your home — when too much current flows through, the breaker trips and stops the flow to prevent damage.
How Circuit Breakers Work
A circuit breaker wraps calls to external services and tracks their success and failure rates. It operates in three states:
stateDiagram-v2
[*] --> Closed
Closed --> Open: Failure threshold exceeded
Open --> HalfOpen: Timeout expires
HalfOpen --> Closed: Test calls succeed
HalfOpen --> Open: Test calls fail
Closed --> Closed: Calls succeed
State Transitions
Closed State (Normal Operation)
- All requests pass through to the downstream service
- Circuit breaker tracks success and failure rates
- If failure rate exceeds threshold, transition to Open
Open State (Failing Fast)
- All requests fail immediately without calling the service
- Returns a predefined error or fallback response
- After a timeout period, transition to Half-Open
Half-Open State (Testing Recovery)
- Allow a limited number of test requests through
- If test requests succeed, transition back to Closed
- If test requests fail, return to Open state
Core Concepts
Failure Threshold
The circuit breaker monitors a sliding window of recent requests and calculates the failure rate. When failures exceed the configured threshold, the circuit opens.
// Resilience4j configuration
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open at 50% failure rate
.slidingWindowSize(10) // Track last 10 calls
.minimumNumberOfCalls(5) // Need 5 calls before calculating rate
.waitDurationInOpenState(Duration.ofSeconds(30)) // Stay open for 30s
.permittedNumberOfCallsInHalfOpenState(3) // Allow 3 test calls
.build();
Sliding Window
Circuit breakers use either count-based or time-based sliding windows to aggregate call outcomes:
Count-Based Window: Tracks the last N calls
- Simple and predictable
- Good for high-traffic services
- Example: Last 100 requests
Time-Based Window: Tracks calls within a time period
- Better for variable traffic patterns
- Adapts to traffic spikes
- Example: Last 60 seconds
Timeout and Recovery
After opening, the circuit breaker waits for a configured duration before transitioning to Half-Open. This gives the failing service time to recover without being overwhelmed by requests.
// Polly (.NET) configuration
var circuitBreakerPolicy = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (exception, duration) => {
Console.WriteLine($"Circuit opened for {duration.TotalSeconds}s");
},
onReset: () => {
Console.WriteLine("Circuit closed, service recovered");
}
);
Implementation Examples
Java with Resilience4j
Resilience4j is the modern replacement for Netflix Hystrix (now in maintenance mode). It’s lightweight, functional, and designed for Java 8+.
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
import java.util.function.Supplier;
public class PaymentService {
private final CircuitBreaker circuitBreaker;
private final ExternalPaymentAPI paymentAPI;
public PaymentService(ExternalPaymentAPI paymentAPI) {
this.paymentAPI = paymentAPI;
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(IllegalArgumentException.class)
.build();
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
this.circuitBreaker = registry.circuitBreaker("paymentService");
// Register event listeners
circuitBreaker.getEventPublisher()
.onStateTransition(event ->
System.out.println("Circuit breaker state: " + event.getStateTransition())
)
.onError(event ->
System.out.println("Call failed: " + event.getThrowable().getMessage())
);
}
public PaymentResult processPayment(PaymentRequest request) {
Supplier<PaymentResult> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> paymentAPI.charge(request));
try {
return decoratedSupplier.get();
} catch (CallNotPermittedException e) {
// Circuit is open, return fallback
return PaymentResult.unavailable("Payment service temporarily unavailable");
} catch (Exception e) {
// Other errors
return PaymentResult.error("Payment failed: " + e.getMessage());
}
}
}
.NET with Polly
Polly is the standard resilience library for .NET applications.
using Polly;
using Polly.CircuitBreaker;
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class OrderService
{
private readonly HttpClient _httpClient;
private readonly AsyncCircuitBreakerPolicy<HttpResponseMessage> _circuitBreakerPolicy;
public OrderService(HttpClient httpClient)
{
_httpClient = httpClient;
_circuitBreakerPolicy = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.Or<HttpRequestException>()
.Or<TaskCanceledException>()
.AdvancedCircuitBreakerAsync(
failureThreshold: 0.5, // Open at 50% failure rate
samplingDuration: TimeSpan.FromSeconds(10),
minimumThroughput: 5, // Need 5 calls in window
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (result, duration) =>
{
Console.WriteLine($"Circuit opened for {duration.TotalSeconds}s");
},
onReset: () =>
{
Console.WriteLine("Circuit closed");
},
onHalfOpen: () =>
{
Console.WriteLine("Circuit half-open, testing...");
}
);
}
public async Task<Order> GetOrderAsync(string orderId)
{
try
{
var response = await _circuitBreakerPolicy.ExecuteAsync(async () =>
await _httpClient.GetAsync($"/api/orders/{orderId}")
);
if (response.IsSuccessStatusCode)
{
return await response.Content.ReadAsAsync<Order>();
}
return Order.NotFound(orderId);
}
catch (BrokenCircuitException)
{
// Circuit is open, return cached data or default
return await GetCachedOrderAsync(orderId)
?? Order.Unavailable(orderId);
}
}
private async Task<Order> GetCachedOrderAsync(string orderId)
{
// Return cached data if available
return null;
}
}
Go Implementation
Go doesn’t have a standard circuit breaker library, but implementing one is straightforward.
package circuitbreaker
import (
"errors"
"sync"
"time"
)
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
var (
ErrCircuitOpen = errors.New("circuit breaker is open")
)
type CircuitBreaker struct {
maxFailures int
timeout time.Duration
halfOpenMax int
mu sync.RWMutex
state State
failures int
lastFailTime time.Time
halfOpenCalls int
}
func New(maxFailures int, timeout time.Duration, halfOpenMax int) *CircuitBreaker {
return &CircuitBreaker{
maxFailures: maxFailures,
timeout: timeout,
halfOpenMax: halfOpenMax,
state: StateClosed,
}
}
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
// Check if we should transition from Open to Half-Open
if cb.state == StateOpen {
if time.Since(cb.lastFailTime) > cb.timeout {
cb.state = StateHalfOpen
cb.halfOpenCalls = 0
} else {
cb.mu.Unlock()
return ErrCircuitOpen
}
}
// Reject if Half-Open and already testing
if cb.state == StateHalfOpen && cb.halfOpenCalls >= cb.halfOpenMax {
cb.mu.Unlock()
return ErrCircuitOpen
}
if cb.state == StateHalfOpen {
cb.halfOpenCalls++
}
cb.mu.Unlock()
// Execute the function
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.onFailure()
return err
}
cb.onSuccess()
return nil
}
func (cb *CircuitBreaker) onSuccess() {
if cb.state == StateHalfOpen {
// Successful test call, close the circuit
cb.state = StateClosed
cb.failures = 0
cb.halfOpenCalls = 0
} else if cb.state == StateClosed {
// Reset failure count on success
cb.failures = 0
}
}
func (cb *CircuitBreaker) onFailure() {
cb.failures++
cb.lastFailTime = time.Now()
if cb.state == StateHalfOpen {
// Test failed, reopen circuit
cb.state = StateOpen
cb.halfOpenCalls = 0
} else if cb.failures >= cb.maxFailures {
// Too many failures, open circuit
cb.state = StateOpen
}
}
func (cb *CircuitBreaker) State() State {
cb.mu.RLock()
defer cb.mu.RUnlock()
return cb.state
}
// Usage example
func main() {
cb := New(5, 30*time.Second, 3)
err := cb.Call(func() error {
// Call external service
resp, err := http.Get("https://api.example.com/data")
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return errors.New("service returned error")
}
return nil
})
if err == ErrCircuitOpen {
// Circuit is open, use fallback
fmt.Println("Service unavailable, using cached data")
} else if err != nil {
// Other error
fmt.Printf("Request failed: %v\n", err)
}
}
Best Practices
1. Configure Thresholds Based on Traffic Patterns
Don’t use the same configuration for all services. High-traffic services need larger sliding windows, while low-traffic services need smaller ones.
// High-traffic service (1000+ req/min)
CircuitBreakerConfig highTraffic = CircuitBreakerConfig.custom()
.slidingWindowSize(100)
.minimumNumberOfCalls(20)
.failureRateThreshold(50)
.build();
// Low-traffic service (10-50 req/min)
CircuitBreakerConfig lowTraffic = CircuitBreakerConfig.custom()
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.failureRateThreshold(60)
.build();
2. Implement Fallback Strategies
When the circuit is open, don’t just return errors. Provide fallback responses:
- Return cached data
- Return default values
- Degrade functionality gracefully
- Queue requests for later processing
public UserProfile getUserProfile(String userId) {
try {
return circuitBreaker.executeSupplier(() ->
userService.fetchProfile(userId)
);
} catch (CallNotPermittedException e) {
// Circuit open, try cache
UserProfile cached = cache.get(userId);
if (cached != null) {
return cached.withStaleWarning();
}
// Return minimal profile
return UserProfile.minimal(userId);
}
}
3. Monitor Circuit Breaker State
Export circuit breaker metrics to your monitoring system. Track:
- State transitions (Closed → Open → Half-Open)
- Failure rates
- Call duration
- Number of rejected calls
// Resilience4j with Micrometer
CircuitBreaker circuitBreaker = registry.circuitBreaker("paymentService");
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(registry)
.bindTo(meterRegistry);
// Metrics available:
// - resilience4j.circuitbreaker.state (0=closed, 1=open, 2=half_open)
// - resilience4j.circuitbreaker.calls (success, failure, not_permitted)
// - resilience4j.circuitbreaker.failure.rate
4. Use Different Timeouts for Different Failure Types
Not all failures are equal. Network timeouts might need longer recovery periods than HTTP 500 errors.
var policy = Policy
.Handle<TimeoutException>()
.CircuitBreakerAsync(3, TimeSpan.FromMinutes(2)) // Long timeout for network issues
.WrapAsync(
Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)) // Short timeout for HTTP errors
);
5. Combine with Retry and Timeout Policies
Circuit breakers work best when combined with other resilience patterns:
// Resilience4j combining multiple patterns
Retry retry = Retry.of("paymentService", RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.build());
TimeLimiter timeLimiter = TimeLimiter.of(TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(2))
.build());
Supplier<PaymentResult> decoratedSupplier = Decorators
.ofSupplier(() -> paymentAPI.charge(request))
.withCircuitBreaker(circuitBreaker)
.withRetry(retry)
.withTimeLimiter(timeLimiter)
.decorate();
Common Pitfalls
1. Setting Thresholds Too Low
Opening the circuit too aggressively can cause unnecessary service degradation. A few failures in a high-traffic system are normal.
Bad: Open after 2 failures
.failureRateThreshold(20) // Opens at 20% failure rate
.minimumNumberOfCalls(2) // With only 2 calls
Good: Require statistical significance
.failureRateThreshold(50) // Opens at 50% failure rate
.minimumNumberOfCalls(10) // Need at least 10 calls
2. Not Handling Circuit Open State
Failing to provide fallbacks when the circuit is open defeats the purpose of the pattern.
Bad: Propagate the error
public Data getData() {
return circuitBreaker.executeSupplier(() -> api.fetch());
// Throws CallNotPermittedException when open
}
Good: Provide fallback
public Data getData() {
try {
return circuitBreaker.executeSupplier(() -> api.fetch());
} catch (CallNotPermittedException e) {
return cache.getOrDefault(Data.empty());
}
}
3. Sharing Circuit Breakers Across Different Endpoints
Each external dependency should have its own circuit breaker. Sharing one breaker across multiple endpoints means one failing endpoint can block all others.
Bad: One breaker for entire service
CircuitBreaker breaker = registry.circuitBreaker("externalService");
breaker.executeSupplier(() -> api.getUsers());
breaker.executeSupplier(() -> api.getOrders()); // Blocked if getUsers fails
Good: Separate breakers per endpoint
CircuitBreaker userBreaker = registry.circuitBreaker("externalService-users");
CircuitBreaker orderBreaker = registry.circuitBreaker("externalService-orders");
userBreaker.executeSupplier(() -> api.getUsers());
orderBreaker.executeSupplier(() -> api.getOrders()); // Independent
4. Ignoring Slow Calls
Circuit breakers should track not just failures, but also slow calls that tie up resources.
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(50) // Also track slow calls
.slowCallDurationThreshold(Duration.ofSeconds(3)) // >3s is slow
.build();
Service Mesh Integration
Modern service meshes like Istio and Linkerd implement circuit breakers at the infrastructure level, removing the need for application-level libraries.
Istio Circuit Breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
maxRequestsPerConnection: 2
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 50
Advantages:
- No code changes required
- Consistent behavior across all services
- Centralized configuration
- Language-agnostic
Disadvantages:
- Less fine-grained control
- Harder to test locally
- Requires service mesh infrastructure
Testing Circuit Breakers
Unit Testing
@Test
public void testCircuitOpensAfterFailures() {
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slidingWindowSize(4)
.minimumNumberOfCalls(4)
.build();
CircuitBreaker breaker = CircuitBreaker.of("test", config);
// Simulate 3 failures
for (int i = 0; i < 3; i++) {
try {
breaker.executeSupplier(() -> {
throw new RuntimeException("Service down");
});
} catch (Exception e) {
// Expected
}
}
// Circuit should still be closed (need 4 calls minimum)
assertEquals(CircuitBreaker.State.CLOSED, breaker.getState());
// One more failure should open it
try {
breaker.executeSupplier(() -> {
throw new RuntimeException("Service down");
});
} catch (Exception e) {
// Expected
}
// Now circuit should be open
assertEquals(CircuitBreaker.State.OPEN, breaker.getState());
// Next call should fail fast
assertThrows(CallNotPermittedException.class, () -> {
breaker.executeSupplier(() -> "success");
});
}
Integration Testing with Chaos Engineering
Use tools like Chaos Monkey or Toxiproxy to simulate service failures and verify circuit breaker behavior.
# Toxiproxy: Add latency to payment service
toxiproxy-cli toxic add payment-service -t latency -a latency=5000
# Verify circuit breaker opens after threshold
curl http://localhost:8080/api/orders/123
# Should return fallback response after circuit opens
Resources
- Resilience4j Documentation — Modern Java resilience library
- Polly Documentation — .NET resilience and transient-fault-handling library
- AWS Prescriptive Guidance: Circuit Breaker Pattern — Cloud implementation patterns
- Martin Fowler: Circuit Breaker — Original pattern description
- Microsoft: Implement Circuit Breaker Pattern — .NET implementation guide
Comments