Introduction
In distributed systems, a single failing service can cascade failures throughout your entire infrastructure. When a downstream service becomes slow or unavailable, waiting for timeouts can exhaust resources and bring down your entire application. The Circuit Breaker pattern provides a solution by acting as a proxy that monitors failures and “trips” to stop requests from overwhelming failing services.
The Problem: Cascade Failures
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cascade Failure Scenario โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ User Request โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโ โ
โ โ Service A โ โ
โ โโโโโโโโฌโโโโโโโ โ
โ โ (slow/down) โ
โ โผ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Service B โโโโโโถโ Service C โ โ
โ โ (timeout) โ โ (healthy) โ โ
โ โโโโโโโโฌโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ Threads blocked waiting for Service B โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Memory exhaustion โ More timeouts โ Crash โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
How Circuit Breaker Works
The Circuit Breaker acts as a state machine with three states:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Circuit Breaker States โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโ failure threshold โโโโโโโโโโโโ โ
โ โ CLOSED โโโโโโโโโโโโโโโโโโโโโโโโโโถโ OPEN โ โ
โ โ (normal) โ โ (blocked) โ โ
โ โโโโโโโโโโโโ โโโโโโโฌโโโโโโ โ
โ โฒ โ โ
โ โ success threshold โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ (half-open) โ
โ โ
โ CLOSED: Requests pass through normally โ
โ Failures are counted โ
โ โ
โ OPEN: Requests fail immediately โ
โ Returns fallback response โ
โ โ
โ HALF-OPEN: Test requests allowed โ
โ If successful โ CLOSED โ
โ If failed โ OPEN again โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implementation Examples
Python Implementation
import time
import threading
from enum import Enum
from typing import Callable, Any, Optional
from functools import wraps
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: int = 60,
success_threshold: int = 3,
expected_exception: type = Exception
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.success_threshold = success_threshold
self.expected_exception = expected_exception
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self._lock = threading.Lock()
def call(self, func: Callable, *args, **kwargs) -> Any:
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self._transition_to_half_open()
else:
raise CircuitBreakerOpenError("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except self.expected_exception as e:
self._on_failure()
raise e
def _should_attempt_reset(self) -> bool:
if self.last_failure_time is None:
return False
return (time.time() - self.last_failure_time) >= self.recovery_timeout
def _on_success(self):
with self._lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self._transition_to_closed()
else:
self.failure_count = 0
def _on_failure(self):
with self._lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self._transition_to_open()
elif self.failure_count >= self.failure_threshold:
self._transition_to_open()
def _transition_to_open(self):
self.state = CircuitState.OPEN
self.success_count = 0
def _transition_to_half_open(self):
self.state = CircuitState.HALF_OPEN
self.success_count = 0
def _transition_to_closed(self):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
class CircuitBreakerOpenError(Exception):
pass
def circuit_breaker(circuit: CircuitBreaker):
def decorator(func: Callable) -> Callable:
@wraps(func)
def wrapper(*args, **kwargs):
return circuit.call(func, *args, **kwargs)
return wrapper
return decorator
Using the Circuit Breaker
import requests
# Create circuit breaker instance
circuit = CircuitBreaker(
failure_threshold=3,
recovery_timeout=30,
success_threshold=2
)
# Apply to a function
@circuit_breaker(circuit)
def call_external_api(url: str) -> dict:
response = requests.get(url, timeout=5)
response.raise_for_status()
return response.json()
# Usage with fallback
def get_user_data(user_id: str) -> Optional[dict]:
try:
return call_external_api(f"https://api.example.com/users/{user_id}")
except CircuitBreakerOpenError:
# Return cached data or default response
return get_cached_user(user_id)
except requests.RequestException:
return get_cached_user(user_id)
Java Implementation with Resilience4j
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
import java.util.function.Supplier;
public class CircuitBreakerExample {
public static void main(String[] args) {
// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open at 50% failure rate
.slowCallRateThreshold(100) // Slow call threshold
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.build();
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
CircuitBreaker circuitBreaker = registry.circuitBreaker("externalService");
// Decorate your function
Supplier<String> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> {
return callExternalService();
});
// Execute with fallback
String result = Try.ofSupplier(decoratedSupplier)
.recover(Exception.class, e -> {
return getFallbackData();
})
.get();
}
private static String callExternalService() {
// Your external service call here
return "Success";
}
private static String getFallbackData() {
return "Fallback data";
}
}
Go Implementation
package circuitbreaker
import (
"errors"
"sync"
"time"
)
type State int
const (
Closed State = iota
Open
HalfOpen
)
type CircuitBreaker struct {
mu sync.Mutex
state State
failureCount int
successCount int
lastFailure time.Time
// Configuration
FailureThreshold int
SuccessThreshold int
Timeout time.Duration
}
func New(failureThreshold, successThreshold int, timeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
state: Closed,
FailureThreshold: failureThreshold,
SuccessThreshold: successThreshold,
Timeout: timeout,
}
}
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.Lock()
defer cb.mu.Unlock()
if cb.state == Open {
if time.Since(cb.lastFailure) > cb.Timeout {
cb.state = HalfOpen
cb.successCount = 0
} else {
return errors.New("circuit breaker is open")
}
}
err := fn()
if err != nil {
cb.handleFailure()
return err
}
cb.handleSuccess()
return nil
}
func (cb *CircuitBreaker) handleFailure() {
cb.failureCount++
cb.lastFailure = time.Now()
if cb.state == HalfOpen || cb.failureCount >= cb.FailureThreshold {
cb.state = Open
}
}
func (cb *CircuitBreaker) handleSuccess() {
cb.failureCount = 0
if cb.state == HalfOpen {
cb.successCount++
if cb.successCount >= cb.SuccessThreshold {
cb.state = Closed
}
}
}
Configuration Best Practices
Setting Thresholds
# Conservative settings for critical services
critical_circuit = CircuitBreaker(
failure_threshold=3, # Open after 3 failures
recovery_timeout=30, # Try again after 30 seconds
success_threshold=2 # Need 2 successes to close
)
# Relaxed settings for non-critical services
non_critical_circuit = CircuitBreaker(
failure_threshold=10, # Open after 10 failures
recovery_timeout=120, # Try again after 2 minutes
success_threshold=5 # Need 5 successes to close
)
Monitoring and Observability
import logging
from datetime import datetime
class ObservableCircuitBreaker(CircuitBreaker):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.logger = logging.getLogger(__name__)
self.state_changes = []
def _transition_to_open(self):
super()._transition_to_open()
self._log_state_change("OPEN")
self._alert_on_call()
def _transition_to_closed(self):
super()._transition_to_closed()
self._log_state_change("CLOSED")
def _transition_to_half_open(self):
super()._transition_to_half_open()
self._log_state_change("HALF_OPEN")
def _log_state_change(self, new_state: str):
event = {
"timestamp": datetime.utcnow().isoformat(),
"state": new_state,
"failure_count": self.failure_count
}
self.state_changes.append(event)
self.logger.warning(f"Circuit breaker transitioned to {new_state}")
def _alert_on_call(self):
# Send alert to monitoring system
pass
Fallback Strategies
Different Fallback Approaches
# 1. Return cached data
def get_product_with_cache(product_id: str) -> dict:
circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=60)
try:
return circuit.call(get_product_from_api, product_id)
except CircuitBreakerOpenError:
return get_product_from_cache(product_id)
# 2. Return default values
def calculate_discount_with_default(order_id: str) -> float:
try:
return call_pricing_service(order_id)
except (CircuitBreakerOpenError, requests.RequestException):
return 0.0 # Default: no discount
# 3. Queue for later processing
def process_order_with_queue(order: dict) -> bool:
try:
return call_inventory_service(order)
except (CircuitBreakerOpenError, requests.RequestException):
queue_order_for_retry(order)
return True # Return success, process later
# 4. Degrade gracefully
def get_recommendations_degraded(user_id: str) -> list:
try:
return call_ml_service(user_id)
except (CircuitBreakerOpenError, requests.RequestException):
# Return popular items instead of personalized
return get_popular_items()
Common Pitfalls
Pitfall 1: Setting Thresholds Too Low
# BAD: Circuit opens too easily
circuit = CircuitBreaker(failure_threshold=1, recovery_timeout=5)
# GOOD: More resilient configuration
circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
Pitfall 2: No Fallback Strategy
# BAD: No fallback - will still fail
result = call_external_service() # Raises exception when open
# GOOD: Always has fallback
result = get_data_with_fallback() # Returns cached/default data
Pitfall 3: Ignoring Slow Calls
# BAD: Only counts exceptions
circuit = CircuitBreaker(failure_threshold=5)
# GOOD: Also handles slow calls
circuit = CircuitBreaker(
failure_threshold=5,
slow_call_timeout=5, # Treat slow calls as failures
slow_call_threshold=3
)
Tools and Libraries
| Language | Library | Features |
|---|---|---|
| Java | Resilience4j | Circuit breaker, retry, rate limiter |
| .NET | Polly | Circuit breaker, retry, bulkhead |
| Python | pybreaker | Circuit breaker implementation |
| Go | gobreaker | Circuit breaker pattern |
| Node.js | opossum | Circuit breaker for Node.js |
| Ruby | breaker_machines | Modern circuit breaker |
AWS Lambda Considerations
For serverless architectures, consider these approaches:
# AWS Lambda with circuit breaker using Step Functions
def lambda_handler(event, context):
try:
# Attempt the operation
result = call_external_service()
return {"statusCode": 200, "body": result}
except CircuitBreakerOpenError:
# Return cached or default response
return {"statusCode": 200, "body": get_fallback_data()}
except Exception as e:
# Log and return error
logger.error(f"Error: {str(e)}")
return {"statusCode": 500, "body": "Internal error"}
Conclusion
The Circuit Breaker pattern is essential for building resilient distributed systems. By preventing cascade failures and providing graceful degradation, it helps maintain system availability even when dependencies fail. Key takeaways:
- Set appropriate thresholds based on service criticality
- Always implement fallbacks to provide degraded functionality
- Monitor circuit breaker state to detect issues early
- Use with other patterns like retry, timeout, and bulkhead for comprehensive resilience
- Configure recovery properly to allow services to recover
Implement circuit breakers at service boundaries, especially for external API calls and downstream dependencies. This pattern, combined with proper monitoring and fallback strategies, forms the foundation of resilient microservices architecture.
Resources
- Circuit Breaker Pattern - AWS Architecture
- Resilience4j Documentation
- Polly Library for .NET
- Building Microservices by Sam Newman
Comments