Circuit Breaker Pattern: Fault Isolation for Distributed Systems
The circuit breaker pattern prevents cascading failures by stopping requests to failing services. This guide covers implementation, best practices, and integration.
The Problem: Cascading Failures
Without Circuit Breaker:
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Service โโโโโโถโ Service โโโโโโถโ Service โ
โ A โ โ B โ โ C โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ โ
โผ โผ
Failed! Overwhelmed!
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CASCADE FAILURE! โ
โ All services down โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Solution: Circuit Breaker
With Circuit Breaker:
โโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโ
โ Service โโโโโโถโCircuit Breakerโโโโโโถโ Service โ
โ A โ โ โ โ C โ
โโโโโโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโ
โ Service โ
โ B is โ
โ failing! โ
โโโโโโฌโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Fast failure โ
โ (don't wait) โ
โโโโโโโโโโโโโโโโโโโ
Circuit Breaker States
โโโโโโโโโโโโโโโโ
โ โ
โโโโโโโ CLOSED โโโโโโโโ
โ โ โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โฒ โ
โ โ โ
Success Failure Success
โ โ โ
โ โ โ
โผ โ โผ
โโโโโโโโโโโโโโโโ โโโโโโโดโโโโโโ โโโโโโโโโโโโโโโโ
โ โ โ โ โ โ
โ OPEN โโโโโโโโ FAILURE โโโ CLOSED โ
โ (reject โ โ THRESHOLD โ โ (normal โ
โ requests) โ โ REACHED โ โ operation) โ
โ โ โโโโโโโโโโโโโโโ โ โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ
โ Timeout
โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ โ โ โ
โ HALF-OPEN โโโโโโโถโ CLOSED โโโโโถ Success!
โ (testing) โ โ (reset) โ
โ โ โ โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ
โผ
โ Failure
โผ
โโโโโโโโโโโโโโโโ
โ โ
โ OPEN โ
โ (retry) โ
โ โ
โโโโโโโโโโโโโโโโ
Implementation
Python Implementation
import time
from enum import Enum
from typing import Callable, Any
import threading
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
success_threshold: int = 2,
timeout: float = 60.0,
excluded_exceptions: tuple = ()
):
self.failure_threshold = failure_threshold
self.success_threshold = success_threshold
self.timeout = timeout
self.excluded_exceptions = excluded_exceptions
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self._lock = threading.Lock()
def call(self, func: Callable, *args, **kwargs) -> Any:
if not self._can_execute():
raise CircuitBreakerOpenError(
f"Circuit breaker is {self.state.value}"
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
if self._should_handle(e):
self._on_failure()
raise e
def _can_execute(self) -> bool:
with self._lock:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
# Check if timeout has passed
if time.time() - self.last_failure_time >= self.timeout:
self._transition_to_half_open()
return True
return False
# HALF_OPEN - allow one request
return True
def _should_handle(self, exception: Exception) -> bool:
# Don't count excluded exceptions
return not isinstance(exception, self.excluded_exceptions)
def _on_success(self):
with self._lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self._transition_to_closed()
else:
# Reset failure count on success
self.failure_count = 0
def _on_failure(self):
with self._lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self._transition_to_open()
elif self.failure_count >= self.failure_threshold:
self._transition_to_open()
def _transition_to_open(self):
self.state = CircuitState.OPEN
print(f"Circuit breaker opened after {self.failure_count} failures")
def _transition_to_half_open(self):
self.state = CircuitState.HALF_OPEN
self.success_count = 0
print("Circuit breaker half-open, testing...")
def _transition_to_closed(self):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
print("Circuit breaker closed,ๆขๅคๆญฃๅธธ")
class CircuitBreakerOpenError(Exception):
pass
# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)
def call_external_service():
response = requests.get("https://api.example.com/data")
return response.json()
# Wrapped call
try:
result = breaker.call(call_external_service)
except CircuitBreakerOpenError:
# Handle gracefully - return cached data or default
return get_cached_data()
Decorator Version
from functools import wraps
def circuit_breaker(failure_threshold=5, timeout=60):
def decorator(func):
breaker = CircuitBreaker(
failure_threshold=failure_threshold,
timeout=timeout
)
@wraps(func)
def wrapper(*args, **kwargs):
return breaker.call(func, *args, **kwargs)
# Expose breaker for inspection
wrapper.circuit_breaker = breaker
return wrapper
return decorator
# Usage
@circuit_breaker(failure_threshold=3, timeout=30)
def fetch_user(user_id):
return requests.get(f"/users/{user_id}").json()
# Check circuit state
if fetch_user.circuit_breaker.state == CircuitState.OPEN:
print("Service is currently unavailable")
Integration with Libraries
Resilience4j (Java)
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open at 50% failure
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("externalService", config);
// Decorate and call
Supplier<String> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> externalApiCall());
String result = decoratedSupplier.get();
Hystrix (Java)
import com.netflix.hystrix.HystrixCommand;
class GetUserCommand extends HystrixCommand<User> {
private final Long userId;
public GetUserCommand(Long userId) {
super(HystrixCommandGroupKey.Factory.asKey("UserService"));
this.userId = userId;
}
@Override
protected User run() throws Exception {
return userService.getUser(userId);
}
@Override
protected User getFallback() {
return cache.get(userId) // Return cached data
}
}
// Usage
User user = new GetUserCommand(123L).execute();
Go (Golang)
import "github.com/sony/gobreaker"
type CircuitBreakerSettings struct {
Name: "external-api",
MaxRequests: 3,
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
}
var cb *gobreaker.CircuitBreaker
func init() {
settings := gobreaker.Settings{
Name: "external-api",
MaxRequests: 3,
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return failureRatio >= 0.5
},
}
cb = gobreaker.NewCircuitBreaker(settings)
}
func CallExternalAPI() (interface{}, error) {
result, err := cb.Execute(func() (interface{}, error) {
return externalService.Call()
})
return result, err
}
Fallback Strategies
def get_user_fallback(user_id, error):
"""Different fallback strategies based on error"""
# Strategy 1: Return cached data
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Strategy 2: Return default user
return {"id": user_id, "name": "Unknown User"}
# Strategy 3: Return stale data
return get_stale_user_data(user_id)
# Strategy 4: Queue for retry
queue.push({"user_id": user_id, "retry_at": time.time() + 60})
return {"status": "queued"}
# Usage with fallback
@circuit_breaker(fallback=get_user_fallback)
def get_user(user_id):
return api.get_user(user_id)
Monitoring and Observability
import prometheus_client as prom
# Metrics
circuit_requests = prom.Counter(
'circuit_breaker_requests_total',
'Total requests',
['name', 'state']
)
circuit_failures = prom.Counter(
'circuit_breaker_failures_total',
'Total failures',
['name', 'state']
)
class InstrumentedCircuitBreaker(CircuitBreaker):
def _on_failure(self):
super()._on_failure()
circuit_failures.labels(
name=self.name,
state=self.state.value
).inc()
def _on_success(self):
super()._on_success()
circuit_requests.labels(
name=self.name,
state=self.state.value
).inc()
Best Practices
Do’s
- Use appropriate thresholds - Test under load to find right values
- Implement fallbacks - Always have a backup plan
- Monitor circuit state - Track state changes in your metrics
- Log state changes - Important for debugging
- Use with timeouts - Combine with request timeouts
Don’ts
- Don’t set thresholds too low - Transient failures will trip
- Don’t forget timeouts - Service might hang, not fail
- Don’t ignore excluded exceptions - Network timeouts โ business errors
- Don’t over-engineer - Not every call needs a circuit breaker
Comparison with Other Patterns
| Pattern | Purpose | Use Case |
|---|---|---|
| Circuit Breaker | Prevent cascading failures | External API calls |
| Bulkhead | Isolate resources | Thread pools, connections |
| Retry | Handle transient failures | Network calls |
| Timeout | Limit wait time | All remote calls |
| Rate Limiter | Prevent overload | Public APIs |
Comments