Skip to main content

Circuit Breaker Pattern: Fault Isolation for Distributed Systems

Created: February 26, 2026 Larry Qu 6 min read

When a downstream service fails, callers that keep hammering it waste resources and risk cascading the failure upstream. The circuit breaker pattern prevents this by stopping requests to a failing service once it crosses a failure threshold, giving it time to recover. This guide covers circuit breaker implementation, best practices, and integration with other resilience patterns.

The Problem: Cascading Failures

Without Circuit Breaker:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Service │────▶│ Service │────▶│ Service │
│    A    │     │    B    │     │    C    │
└─────────┘     └─────────┘     └─────────┘
                      │               │
                      ▼               ▼
                   Failed!         Overwhelmed!
                      │               │
                      ▼               ▼
                 ┌─────────────────────────┐
                 │   CASCADE FAILURE!     │
                 │   All services down    │
                 └─────────────────────────┘

The Solution: Circuit Breaker

With Circuit Breaker:
┌─────────┐     ┌──────────────┐     ┌─────────┐
│ Service │────▶│Circuit Breaker│────▶│ Service │
│    A    │     │              │     │    C    │
└─────────┘     └──────┬───────┘     └─────────┘
                   ┌──────────┐
                   │ Service  │
                   │   B is   │
                   │ failing! │
                   └────┬─────┘
                   ┌─────────────────┐
                   │  Fast failure   │
                   │  (don't wait)   │
                   └─────────────────┘

Circuit Breaker States

                    ┌──────────────┐
                    │              │
              ┌─────│   CLOSED    │──────┐
              │     │              │      │
              │     └──────────────┘      │
              │              ▲            │
              │              │            │
        Success          Failure       Success
              │              │            │
              │              │            │
              ▼              │            ▼
┌──────────────┐      ┌─────┴─────┐ ┌──────────────┐
│              │      │            │ │              │
│    OPEN      │◀─────│  FAILURE   │─│   CLOSED    │
│  (reject    │      │  THRESHOLD  │ │  (normal    │
│   requests)  │      │   REACHED   │ │   operation) │
│              │      └─────────────┘ │              │
└──────────────┘                      └──────────────┘
       │ Timeout
┌──────────────┐      ┌──────────────┐
│              │      │              │
│  HALF-OPEN  │─────▶│   CLOSED    │───▶ Success!
│  (testing)  │      │   (reset)   │
│              │      │              │
└──────────────┘      └──────────────┘
       │ Failure
┌──────────────┐
│              │
│    OPEN      │
│   (retry)    │
│              │
└──────────────┘

Implementation

Python Implementation

import time
from enum import Enum
from typing import Callable, Any
import threading

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        success_threshold: int = 2,
        timeout: float = 60.0,
        excluded_exceptions: tuple = ()
    ):
        self.failure_threshold = failure_threshold
        self.success_threshold = success_threshold
        self.timeout = timeout
        self.excluded_exceptions = excluded_exceptions
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self._lock = threading.Lock()
    
    def call(self, func: Callable, *args, **kwargs) -> Any:
        if not self._can_execute():
            raise CircuitBreakerOpenError(
                f"Circuit breaker is {self.state.value}"
            )
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            if self._should_handle(e):
                self._on_failure()
            raise e
    
    def _can_execute(self) -> bool:
        with self._lock:
            if self.state == CircuitState.CLOSED:
                return True
            
            if self.state == CircuitState.OPEN:
                # Check if timeout has passed
                if time.time() - self.last_failure_time >= self.timeout:
                    self._transition_to_half_open()
                    return True
                return False
            
            # HALF_OPEN - allow one request
            return True
    
    def _should_handle(self, exception: Exception) -> bool:
        # Don't count excluded exceptions
        return not isinstance(exception, self.excluded_exceptions)
    
    def _on_success(self):
        with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self._transition_to_closed()
            else:
                # Reset failure count on success
                self.failure_count = 0
    
    def _on_failure(self):
        with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self._transition_to_open()
            elif self.failure_count >= self.failure_threshold:
                self._transition_to_open()
    
    def _transition_to_open(self):
        self.state = CircuitState.OPEN
        print(f"Circuit breaker opened after {self.failure_count} failures")
    
    def _transition_to_half_open(self):
        self.state = CircuitState.HALF_OPEN
        self.success_count = 0
        print("Circuit breaker half-open, testing...")
    
    def _transition_to_closed(self):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        print("Circuit breaker closed, back to normal")


class CircuitBreakerOpenError(Exception):
    pass


## Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

def call_external_service():
    response = requests.get("https://api.example.com/data")
    return response.json()

## Wrapped call
try:
    result = breaker.call(call_external_service)
except CircuitBreakerOpenError:
    # Handle gracefully - return cached data or default
    return get_cached_data()

Decorator Version

from functools import wraps

def circuit_breaker(failure_threshold=5, timeout=60):
    def decorator(func):
        breaker = CircuitBreaker(
            failure_threshold=failure_threshold,
            timeout=timeout
        )
        
        @wraps(func)
        def wrapper(*args, **kwargs):
            return breaker.call(func, *args, **kwargs)
        
        # Expose breaker for inspection
        wrapper.circuit_breaker = breaker
        return wrapper
    return decorator


## Usage
@circuit_breaker(failure_threshold=3, timeout=30)
def fetch_user(user_id):
    return requests.get(f"/users/{user_id}").json()

## Check circuit state
if fetch_user.circuit_breaker.state == CircuitState.OPEN:
    print("Service is currently unavailable")

Integration with Libraries

Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;

// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)                    // Open at 50% failure
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)
    .minimumNumberOfCalls(5)
    .permittedNumberOfCallsInHalfOpenState(3)
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("externalService", config);

// Decorate and call
Supplier<String> decoratedSupplier = CircuitBreaker
    .decorateSupplier(circuitBreaker, () -> externalApiCall());

String result = decoratedSupplier.get();

Hystrix (Java)

import com.netflix.hystrix.HystrixCommand;

class GetUserCommand extends HystrixCommand<User> {
    private final Long userId;
    
    public GetUserCommand(Long userId) {
        super(HystrixCommandGroupKey.Factory.asKey("UserService"));
        this.userId = userId;
    }
    
    @Override
    protected User run() throws Exception {
        return userService.getUser(userId);
    }
    
    @Override
    protected User getFallback() {
        return cache.get(userId)  // Return cached data
    }
}

// Usage
User user = new GetUserCommand(123L).execute();

Go (Golang)

import "github.com/sony/gobreaker"

type CircuitBreakerSettings struct {
    Name:        "external-api",
    MaxRequests: 3,
    Interval:    10 * time.Second,
    Timeout:     30 * time.Second,
}

var cb *gobreaker.CircuitBreaker

func init() {
    settings := gobreaker.Settings{
        Name:        "external-api",
        MaxRequests: 3,
        Interval:    10 * time.Second,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            return failureRatio >= 0.5
        },
    }
    cb = gobreaker.NewCircuitBreaker(settings)
}

func CallExternalAPI() (interface{}, error) {
    result, err := cb.Execute(func() (interface{}, error) {
        return externalService.Call()
    })
    return result, err
}

Fallback Strategies

def get_user_fallback(user_id, error):
    """Different fallback strategies based on error"""
    
    # Strategy 1: Return cached data
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Strategy 2: Return default user
    return {"id": user_id, "name": "Unknown User"}
    
    # Strategy 3: Return stale data
    return get_stale_user_data(user_id)
    
    # Strategy 4: Queue for retry
    queue.push({"user_id": user_id, "retry_at": time.time() + 60})
    return {"status": "queued"}


## Usage with fallback
@circuit_breaker(fallback=get_user_fallback)
def get_user(user_id):
    return api.get_user(user_id)

Monitoring and Observability

import prometheus_client as prom

## Metrics
circuit_requests = prom.Counter(
    'circuit_breaker_requests_total',
    'Total requests',
    ['name', 'state']
)

circuit_failures = prom.Counter(
    'circuit_breaker_failures_total',
    'Total failures',
    ['name', 'state']
)

class InstrumentedCircuitBreaker(CircuitBreaker):
    def _on_failure(self):
        super()._on_failure()
        circuit_failures.labels(
            name=self.name,
            state=self.state.value
        ).inc()
    
    def _on_success(self):
        super()._on_success()
        circuit_requests.labels(
            name=self.name,
            state=self.state.value
        ).inc()

Best Practices

Do’s

  1. Use appropriate thresholds - Test under load to find right values
  2. Implement fallbacks - Always have a backup plan
  3. Monitor circuit state - Track state changes in your metrics
  4. Log state changes - Important for debugging
  5. Use with timeouts - Combine with request timeouts

Don’ts

  1. Don’t set thresholds too low - Transient failures will trip
  2. Don’t forget timeouts - Service might hang, not fail
  3. Don’t ignore excluded exceptions - Network timeouts ≠ business errors
  4. Don’t over-engineer - Not every call needs a circuit breaker

Comparison with Other Patterns

Pattern Purpose Use Case
Circuit Breaker Prevent cascading failures External API calls
Bulkhead Isolate resources Thread pools, connections
Retry Handle transient failures Network calls
Timeout Limit wait time All remote calls
Rate Limiter Prevent overload Public APIs

Conclusion

The circuit breaker is a fundamental resilience pattern that prevents cascading failures in distributed systems. Start with conservative thresholds (5-10 failures, 30-60 second timeout) and tune based on observed behavior in production. Combine circuit breakers with retries (with exponential backoff), bulkheads, and timeouts for comprehensive fault isolation. Monitor circuit breaker state closely—frequent openings indicate underlying issues that should be addressed rather than just contained.

Resources

Comments

Share this article

Scan to read on mobile

👍 Was this article helpful?