Skip to main content
โšก Calmops

Circuit Breaker Pattern: Fault Isolation for Distributed Systems

Circuit Breaker Pattern: Fault Isolation for Distributed Systems

The circuit breaker pattern prevents cascading failures by stopping requests to failing services. This guide covers implementation, best practices, and integration.

The Problem: Cascading Failures

Without Circuit Breaker:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Service โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Service โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Service โ”‚
โ”‚    A    โ”‚     โ”‚    B    โ”‚     โ”‚    C    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚               โ”‚
                      โ–ผ               โ–ผ
                   Failed!         Overwhelmed!
                      โ”‚               โ”‚
                      โ–ผ               โ–ผ
                 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                 โ”‚   CASCADE FAILURE!     โ”‚
                 โ”‚   All services down    โ”‚
                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Solution: Circuit Breaker

With Circuit Breaker:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Service โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚Circuit Breakerโ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Service โ”‚
โ”‚    A    โ”‚     โ”‚              โ”‚     โ”‚    C    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚
                        โ–ผ
                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                   โ”‚ Service  โ”‚
                   โ”‚   B is   โ”‚
                   โ”‚ failing! โ”‚
                   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚
                        โ–ผ
                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                   โ”‚  Fast failure   โ”‚
                   โ”‚  (don't wait)   โ”‚
                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Circuit Breaker States

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚              โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”‚   CLOSED    โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚     โ”‚              โ”‚      โ”‚
              โ”‚     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
              โ”‚              โ–ฒ            โ”‚
              โ”‚              โ”‚            โ”‚
        Success          Failure       Success
              โ”‚              โ”‚            โ”‚
              โ”‚              โ”‚            โ”‚
              โ–ผ              โ”‚            โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              โ”‚      โ”‚            โ”‚ โ”‚              โ”‚
โ”‚    OPEN      โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”‚  FAILURE   โ”‚โ”€โ”‚   CLOSED    โ”‚
โ”‚  (reject    โ”‚      โ”‚  THRESHOLD  โ”‚ โ”‚  (normal    โ”‚
โ”‚   requests)  โ”‚      โ”‚   REACHED   โ”‚ โ”‚   operation) โ”‚
โ”‚              โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
       โ”‚ Timeout
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              โ”‚      โ”‚              โ”‚
โ”‚  HALF-OPEN  โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚   CLOSED    โ”‚โ”€โ”€โ”€โ–ถ Success!
โ”‚  (testing)  โ”‚      โ”‚   (reset)   โ”‚
โ”‚              โ”‚      โ”‚              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
       โ–ผ
       โ”‚ Failure
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              โ”‚
โ”‚    OPEN      โ”‚
โ”‚   (retry)    โ”‚
โ”‚              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Implementation

Python Implementation

import time
from enum import Enum
from typing import Callable, Any
import threading

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        success_threshold: int = 2,
        timeout: float = 60.0,
        excluded_exceptions: tuple = ()
    ):
        self.failure_threshold = failure_threshold
        self.success_threshold = success_threshold
        self.timeout = timeout
        self.excluded_exceptions = excluded_exceptions
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self._lock = threading.Lock()
    
    def call(self, func: Callable, *args, **kwargs) -> Any:
        if not self._can_execute():
            raise CircuitBreakerOpenError(
                f"Circuit breaker is {self.state.value}"
            )
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            if self._should_handle(e):
                self._on_failure()
            raise e
    
    def _can_execute(self) -> bool:
        with self._lock:
            if self.state == CircuitState.CLOSED:
                return True
            
            if self.state == CircuitState.OPEN:
                # Check if timeout has passed
                if time.time() - self.last_failure_time >= self.timeout:
                    self._transition_to_half_open()
                    return True
                return False
            
            # HALF_OPEN - allow one request
            return True
    
    def _should_handle(self, exception: Exception) -> bool:
        # Don't count excluded exceptions
        return not isinstance(exception, self.excluded_exceptions)
    
    def _on_success(self):
        with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self._transition_to_closed()
            else:
                # Reset failure count on success
                self.failure_count = 0
    
    def _on_failure(self):
        with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self._transition_to_open()
            elif self.failure_count >= self.failure_threshold:
                self._transition_to_open()
    
    def _transition_to_open(self):
        self.state = CircuitState.OPEN
        print(f"Circuit breaker opened after {self.failure_count} failures")
    
    def _transition_to_half_open(self):
        self.state = CircuitState.HALF_OPEN
        self.success_count = 0
        print("Circuit breaker half-open, testing...")
    
    def _transition_to_closed(self):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        print("Circuit breaker closed,ๆขๅคๆญฃๅธธ")


class CircuitBreakerOpenError(Exception):
    pass


# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

def call_external_service():
    response = requests.get("https://api.example.com/data")
    return response.json()

# Wrapped call
try:
    result = breaker.call(call_external_service)
except CircuitBreakerOpenError:
    # Handle gracefully - return cached data or default
    return get_cached_data()

Decorator Version

from functools import wraps

def circuit_breaker(failure_threshold=5, timeout=60):
    def decorator(func):
        breaker = CircuitBreaker(
            failure_threshold=failure_threshold,
            timeout=timeout
        )
        
        @wraps(func)
        def wrapper(*args, **kwargs):
            return breaker.call(func, *args, **kwargs)
        
        # Expose breaker for inspection
        wrapper.circuit_breaker = breaker
        return wrapper
    return decorator


# Usage
@circuit_breaker(failure_threshold=3, timeout=30)
def fetch_user(user_id):
    return requests.get(f"/users/{user_id}").json()

# Check circuit state
if fetch_user.circuit_breaker.state == CircuitState.OPEN:
    print("Service is currently unavailable")

Integration with Libraries

Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;

// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)                    // Open at 50% failure
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)
    .minimumNumberOfCalls(5)
    .permittedNumberOfCallsInHalfOpenState(3)
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("externalService", config);

// Decorate and call
Supplier<String> decoratedSupplier = CircuitBreaker
    .decorateSupplier(circuitBreaker, () -> externalApiCall());

String result = decoratedSupplier.get();

Hystrix (Java)

import com.netflix.hystrix.HystrixCommand;

class GetUserCommand extends HystrixCommand<User> {
    private final Long userId;
    
    public GetUserCommand(Long userId) {
        super(HystrixCommandGroupKey.Factory.asKey("UserService"));
        this.userId = userId;
    }
    
    @Override
    protected User run() throws Exception {
        return userService.getUser(userId);
    }
    
    @Override
    protected User getFallback() {
        return cache.get(userId)  // Return cached data
    }
}

// Usage
User user = new GetUserCommand(123L).execute();

Go (Golang)

import "github.com/sony/gobreaker"

type CircuitBreakerSettings struct {
    Name:        "external-api",
    MaxRequests: 3,
    Interval:    10 * time.Second,
    Timeout:     30 * time.Second,
}

var cb *gobreaker.CircuitBreaker

func init() {
    settings := gobreaker.Settings{
        Name:        "external-api",
        MaxRequests: 3,
        Interval:    10 * time.Second,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            return failureRatio >= 0.5
        },
    }
    cb = gobreaker.NewCircuitBreaker(settings)
}

func CallExternalAPI() (interface{}, error) {
    result, err := cb.Execute(func() (interface{}, error) {
        return externalService.Call()
    })
    return result, err
}

Fallback Strategies

def get_user_fallback(user_id, error):
    """Different fallback strategies based on error"""
    
    # Strategy 1: Return cached data
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Strategy 2: Return default user
    return {"id": user_id, "name": "Unknown User"}
    
    # Strategy 3: Return stale data
    return get_stale_user_data(user_id)
    
    # Strategy 4: Queue for retry
    queue.push({"user_id": user_id, "retry_at": time.time() + 60})
    return {"status": "queued"}


# Usage with fallback
@circuit_breaker(fallback=get_user_fallback)
def get_user(user_id):
    return api.get_user(user_id)

Monitoring and Observability

import prometheus_client as prom

# Metrics
circuit_requests = prom.Counter(
    'circuit_breaker_requests_total',
    'Total requests',
    ['name', 'state']
)

circuit_failures = prom.Counter(
    'circuit_breaker_failures_total',
    'Total failures',
    ['name', 'state']
)

class InstrumentedCircuitBreaker(CircuitBreaker):
    def _on_failure(self):
        super()._on_failure()
        circuit_failures.labels(
            name=self.name,
            state=self.state.value
        ).inc()
    
    def _on_success(self):
        super()._on_success()
        circuit_requests.labels(
            name=self.name,
            state=self.state.value
        ).inc()

Best Practices

Do’s

  1. Use appropriate thresholds - Test under load to find right values
  2. Implement fallbacks - Always have a backup plan
  3. Monitor circuit state - Track state changes in your metrics
  4. Log state changes - Important for debugging
  5. Use with timeouts - Combine with request timeouts

Don’ts

  1. Don’t set thresholds too low - Transient failures will trip
  2. Don’t forget timeouts - Service might hang, not fail
  3. Don’t ignore excluded exceptions - Network timeouts โ‰  business errors
  4. Don’t over-engineer - Not every call needs a circuit breaker

Comparison with Other Patterns

Pattern Purpose Use Case
Circuit Breaker Prevent cascading failures External API calls
Bulkhead Isolate resources Thread pools, connections
Retry Handle transient failures Network calls
Timeout Limit wait time All remote calls
Rate Limiter Prevent overload Public APIs

External Resources


Comments