When a downstream service fails, callers that keep hammering it waste resources and risk cascading the failure upstream. The circuit breaker pattern prevents this by stopping requests to a failing service once it crosses a failure threshold, giving it time to recover. This guide covers circuit breaker implementation, best practices, and integration with other resilience patterns.
The Problem: Cascading Failures
Without Circuit Breaker:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Service │────▶│ Service │────▶│ Service │
│ A │ │ B │ │ C │
└─────────┘ └─────────┘ └─────────┘
│ │
▼ ▼
Failed! Overwhelmed!
│ │
▼ ▼
┌─────────────────────────┐
│ CASCADE FAILURE! │
│ All services down │
└─────────────────────────┘
The Solution: Circuit Breaker
With Circuit Breaker:
┌─────────┐ ┌──────────────┐ ┌─────────┐
│ Service │────▶│Circuit Breaker│────▶│ Service │
│ A │ │ │ │ C │
└─────────┘ └──────┬───────┘ └─────────┘
│
▼
┌──────────┐
│ Service │
│ B is │
│ failing! │
└────┬─────┘
│
▼
┌─────────────────┐
│ Fast failure │
│ (don't wait) │
└─────────────────┘
Circuit Breaker States
┌──────────────┐
│ │
┌─────│ CLOSED │──────┐
│ │ │ │
│ └──────────────┘ │
│ ▲ │
│ │ │
Success Failure Success
│ │ │
│ │ │
▼ │ ▼
┌──────────────┐ ┌─────┴─────┐ ┌──────────────┐
│ │ │ │ │ │
│ OPEN │◀─────│ FAILURE │─│ CLOSED │
│ (reject │ │ THRESHOLD │ │ (normal │
│ requests) │ │ REACHED │ │ operation) │
│ │ └─────────────┘ │ │
└──────────────┘ └──────────────┘
│
│ Timeout
▼
┌──────────────┐ ┌──────────────┐
│ │ │ │
│ HALF-OPEN │─────▶│ CLOSED │───▶ Success!
│ (testing) │ │ (reset) │
│ │ │ │
└──────────────┘ └──────────────┘
│
▼
│ Failure
▼
┌──────────────┐
│ │
│ OPEN │
│ (retry) │
│ │
└──────────────┘
Implementation
Python Implementation
import time
from enum import Enum
from typing import Callable, Any
import threading
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
success_threshold: int = 2,
timeout: float = 60.0,
excluded_exceptions: tuple = ()
):
self.failure_threshold = failure_threshold
self.success_threshold = success_threshold
self.timeout = timeout
self.excluded_exceptions = excluded_exceptions
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self._lock = threading.Lock()
def call(self, func: Callable, *args, **kwargs) -> Any:
if not self._can_execute():
raise CircuitBreakerOpenError(
f"Circuit breaker is {self.state.value}"
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
if self._should_handle(e):
self._on_failure()
raise e
def _can_execute(self) -> bool:
with self._lock:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
# Check if timeout has passed
if time.time() - self.last_failure_time >= self.timeout:
self._transition_to_half_open()
return True
return False
# HALF_OPEN - allow one request
return True
def _should_handle(self, exception: Exception) -> bool:
# Don't count excluded exceptions
return not isinstance(exception, self.excluded_exceptions)
def _on_success(self):
with self._lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self._transition_to_closed()
else:
# Reset failure count on success
self.failure_count = 0
def _on_failure(self):
with self._lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self._transition_to_open()
elif self.failure_count >= self.failure_threshold:
self._transition_to_open()
def _transition_to_open(self):
self.state = CircuitState.OPEN
print(f"Circuit breaker opened after {self.failure_count} failures")
def _transition_to_half_open(self):
self.state = CircuitState.HALF_OPEN
self.success_count = 0
print("Circuit breaker half-open, testing...")
def _transition_to_closed(self):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
print("Circuit breaker closed, back to normal")
class CircuitBreakerOpenError(Exception):
pass
## Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)
def call_external_service():
response = requests.get("https://api.example.com/data")
return response.json()
## Wrapped call
try:
result = breaker.call(call_external_service)
except CircuitBreakerOpenError:
# Handle gracefully - return cached data or default
return get_cached_data()
Decorator Version
from functools import wraps
def circuit_breaker(failure_threshold=5, timeout=60):
def decorator(func):
breaker = CircuitBreaker(
failure_threshold=failure_threshold,
timeout=timeout
)
@wraps(func)
def wrapper(*args, **kwargs):
return breaker.call(func, *args, **kwargs)
# Expose breaker for inspection
wrapper.circuit_breaker = breaker
return wrapper
return decorator
## Usage
@circuit_breaker(failure_threshold=3, timeout=30)
def fetch_user(user_id):
return requests.get(f"/users/{user_id}").json()
## Check circuit state
if fetch_user.circuit_breaker.state == CircuitState.OPEN:
print("Service is currently unavailable")
Integration with Libraries
Resilience4j (Java)
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open at 50% failure
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("externalService", config);
// Decorate and call
Supplier<String> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> externalApiCall());
String result = decoratedSupplier.get();
Hystrix (Java)
import com.netflix.hystrix.HystrixCommand;
class GetUserCommand extends HystrixCommand<User> {
private final Long userId;
public GetUserCommand(Long userId) {
super(HystrixCommandGroupKey.Factory.asKey("UserService"));
this.userId = userId;
}
@Override
protected User run() throws Exception {
return userService.getUser(userId);
}
@Override
protected User getFallback() {
return cache.get(userId) // Return cached data
}
}
// Usage
User user = new GetUserCommand(123L).execute();
Go (Golang)
import "github.com/sony/gobreaker"
type CircuitBreakerSettings struct {
Name: "external-api",
MaxRequests: 3,
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
}
var cb *gobreaker.CircuitBreaker
func init() {
settings := gobreaker.Settings{
Name: "external-api",
MaxRequests: 3,
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return failureRatio >= 0.5
},
}
cb = gobreaker.NewCircuitBreaker(settings)
}
func CallExternalAPI() (interface{}, error) {
result, err := cb.Execute(func() (interface{}, error) {
return externalService.Call()
})
return result, err
}
Fallback Strategies
def get_user_fallback(user_id, error):
"""Different fallback strategies based on error"""
# Strategy 1: Return cached data
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Strategy 2: Return default user
return {"id": user_id, "name": "Unknown User"}
# Strategy 3: Return stale data
return get_stale_user_data(user_id)
# Strategy 4: Queue for retry
queue.push({"user_id": user_id, "retry_at": time.time() + 60})
return {"status": "queued"}
## Usage with fallback
@circuit_breaker(fallback=get_user_fallback)
def get_user(user_id):
return api.get_user(user_id)
Monitoring and Observability
import prometheus_client as prom
## Metrics
circuit_requests = prom.Counter(
'circuit_breaker_requests_total',
'Total requests',
['name', 'state']
)
circuit_failures = prom.Counter(
'circuit_breaker_failures_total',
'Total failures',
['name', 'state']
)
class InstrumentedCircuitBreaker(CircuitBreaker):
def _on_failure(self):
super()._on_failure()
circuit_failures.labels(
name=self.name,
state=self.state.value
).inc()
def _on_success(self):
super()._on_success()
circuit_requests.labels(
name=self.name,
state=self.state.value
).inc()
Best Practices
Do’s
- Use appropriate thresholds - Test under load to find right values
- Implement fallbacks - Always have a backup plan
- Monitor circuit state - Track state changes in your metrics
- Log state changes - Important for debugging
- Use with timeouts - Combine with request timeouts
Don’ts
- Don’t set thresholds too low - Transient failures will trip
- Don’t forget timeouts - Service might hang, not fail
- Don’t ignore excluded exceptions - Network timeouts ≠ business errors
- Don’t over-engineer - Not every call needs a circuit breaker
Comparison with Other Patterns
| Pattern | Purpose | Use Case |
|---|---|---|
| Circuit Breaker | Prevent cascading failures | External API calls |
| Bulkhead | Isolate resources | Thread pools, connections |
| Retry | Handle transient failures | Network calls |
| Timeout | Limit wait time | All remote calls |
| Rate Limiter | Prevent overload | Public APIs |
Conclusion
The circuit breaker is a fundamental resilience pattern that prevents cascading failures in distributed systems. Start with conservative thresholds (5-10 failures, 30-60 second timeout) and tune based on observed behavior in production. Combine circuit breakers with retries (with exponential backoff), bulkheads, and timeouts for comprehensive fault isolation. Monitor circuit breaker state closely—frequent openings indicate underlying issues that should be addressed rather than just contained.
Resources
- Circuit Breaker (Martin Fowler) - Original pattern description
- Resilience4j Documentation - Java circuit breaker library
- Netflix Hystrix - Production reference implementation
Comments