Skip to main content

Retry Pattern with Exponential Backoff: Resilient Error Recovery

Published: February 28, 2026 Updated: May 11, 2026 Larry Qu 10 min read

The Retry Pattern is a fundamental resilience pattern that handles transient failures by automatically retrying failed operations. When combined with exponential backoff and jitter, it becomes a powerful tool for building robust distributed systems.

When to Retry and When to Fail

Not every failure should be retried. The key distinction hinges on whether the failure is transient or permanent. Transient failures — such as network timeouts, database deadlocks, 503 Service Unavailable, or 429 Rate Limited — are retry-worthy because the underlying condition may resolve on its own. Permanent failures — including 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, and 422 Validation Error — should never be retried, as the same request will fail identically.

Exponential backoff is the standard approach: first retry after 100ms, then 200ms, 400ms, 800ms, and so on. Adding jitter prevents the thundering herd problem, where all retrying clients synchronize and hit the server simultaneously. Implement a retry budget that limits both total retry time (e.g., max 30 seconds) and the number of retries (e.g., 3-5 attempts). Integrate with a circuit breaker: if retries keep failing, open the circuit and fail fast. The most dangerous pattern is infinite retries with no backoff — this is a self-inflicted DDoS. Real-world SDKs such as the AWS SDKs implement exponential backoff with jitter by default.

Understanding the Retry Pattern

Why Retries Matter

┌─────────────────────────────────────────────────────────────────┐
│              Transient Failures Are Common                       │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    │
│  │  Network     │    │  Service     │    │  Database    │    │
│  │  Timeout     │    │  Restart     │    │  Lock Wait   │    │
│  └──────────────┘    └──────────────┘    └──────────────┘    │
│        │                  │                   │                │
│        ▼                  ▼                   ▼                │
│   Usually recover quickly with retry!                          │
│                                                                 │
│  Statistics:                                                   │
│  - 60% of failures are transient                               │
│  - 90% succeed on retry within 3 attempts                      │
│  - Exponential backoff reduces load by 99%                      │
└─────────────────────────────────────────────────────────────────┘

Without Retry Pattern

┌─────────────────────────────────────────────────────────────────┐
│              Immediate Failure (No Retry)                         │
│                                                                 │
│  Request ──► ✗ Connection timeout                               │
│                │                                                │
│                ▼                                                │
│         ┌─────────────┐                                         │
│         │   ERROR!    │  User sees failure immediately          │
│         │  User upset │                                         │
│         └─────────────┘                                         │
│                                                                 │
│  ✗ Wasted opportunity                                          │
│  ✗ Poor user experience                                         │
│  ✗ No recovery attempt                                          │
└─────────────────────────────────────────────────────────────────┘

With Retry Pattern

┌─────────────────────────────────────────────────────────────────┐
│              Retry with Exponential Backoff                      │
│                                                                 │
│  Request ──► ✗ Connection timeout                               │
│                │                                                │
│                ▼                                                │
│         Wait 100ms ──► ✗ Still timeout                         │
│                           │                                     │
│                           ▼                                     │
│                 Wait 200ms ──► ✗ Still timeout                 │
│                                   │                             │
│                                   ▼                             │
│                         Wait 400ms ──► ✓ Success!│                                                            │    │
│                                                            ▼    │
│                                                     ┌──────────┐│
│                                                     │ SUCCESS! ││
│                                                     └──────────┘│
│                                                                 │
│  ✓ Automatic recovery                                         │
│  ✓ Better user experience                                     │
│  ✓ Reduced load from retry storms                              │
└─────────────────────────────────────────────────────────────────┘

Implementation

Basic Retry with Exponential Backoff

import asyncio
import time
from functools import wraps
from typing import Callable, Type, Tuple

class RetryConfig:
    def __init__(
        self,
        max_attempts: int = 3,
        base_delay: float = 0.1,
        max_delay: float = 30.0,
        exponential_base: float = 2.0,
        jitter: bool = True
    ):
        self.max_attempts = max_attempts
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.exponential_base = exponential_base
        self.jitter = jitter


def calculate_delay(attempt: int, config: RetryConfig) -> float:
    delay = config.base_delay * (config.exponential_base ** attempt)
    delay = min(delay, config.max_delay)
    
    if config.jitter:
        import random
        delay = delay * (0.5 + random.random())
    
    return delay


async def retry_async(
    func: Callable,
    *args,
    config: RetryConfig = None,
    exceptions: Tuple[Type[Exception], ...] = (Exception,),
    **kwargs
):
    config = config or RetryConfig()
    last_exception = None
    
    for attempt in range(config.max_attempts):
        try:
            return await func(*args, **kwargs)
            
        except exceptions as e:
            last_exception = e
            
            if attempt < config.max_attempts - 1:
                delay = calculate_delay(attempt, config)
                await asyncio.sleep(delay)
            else:
                raise last_exception
    
    raise last_exception


def retry_decorator(config: RetryConfig = None):
    def decorator(func: Callable):
        @wraps(func)
        async def async_wrapper(*args, **kwargs):
            return await retry_async(func, *args, config=config, **kwargs)
        
        @wraps(func)
        def sync_wrapper(*args, **kwargs):
            config = config or RetryConfig()
            last_exception = None
            
            for attempt in range(config.max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    if attempt < config.max_attempts - 1:
                        delay = calculate_delay(attempt, config)
                        time.sleep(delay)
            
            raise last_exception
        
        if asyncio.iscoroutinefunction(func):
            return async_wrapper
        return sync_wrapper
    
    return decorator

Configurable Retry Strategies

class RetryStrategy:
    def get_delay(self, attempt: int) -> float:
        raise NotImplementedError


class ExponentialBackoff(RetryStrategy):
    def __init__(
        self, 
        base: float = 0.1, 
        max_delay: float = 30.0,
        multiplier: float = 2.0
    ):
        self.base = base
        self.max_delay = max_delay
        self.multiplier = multiplier
    
    def get_delay(self, attempt: int) -> float:
        delay = self.base * (self.multiplier ** attempt)
        return min(delay, self.max_delay)


class LinearBackoff(RetryStrategy):
    def __init__(self, base: float = 0.1, increment: float = 0.1):
        self.base = base
        self.increment = increment
    
    def get_delay(self, attempt: int) -> float:
        return self.base + (attempt * self.increment)


class ConstantBackoff(RetryStrategy):
    def __init__(self, delay: float = 1.0):
        self.delay = delay
    
    def get_delay(self, attempt: int) -> float:
        return self.delay


class FibonacciBackoff(RetryStrategy):
    def __init__(self, multiplier: float = 1.0):
        self.multiplier = multiplier
        self._cache = {0: 1, 1: 1}
    
    def _fib(self, n: int) -> float:
        if n in self._cache:
            return self._cache[n]
        self._cache[n] = self._fib(n-1) + self._fib(n-2)
        return self._cache[n]
    
    def get_delay(self, attempt: int) -> float:
        return self._fib(attempt) * self.multiplier

Jitter Strategies

import random
import math

class Jitter:
    @staticmethod
    def no_jitter(delay: float) -> float:
        return delay
    
    @staticmethod
    def full_jitter(delay: float) -> float:
        return delay * random.random()
    
    @staticmethod
    def equal_jitter(delay: float) -> float:
        return delay / 2 + (delay / 2) * random.random()
    
    @staticmethod
    def decorrelated_jitter(delay: float, last_delay: float = None) -> float:
        if last_delay is None:
            last_delay = delay
        new_delay = last_delay * random.uniform(1.3, 2.0)
        return min(new_delay, 30.0)


class ExponentialBackoffWithJitter(ExponentialBackoff):
    def __init__(self, base: float = 0.1, max_delay: float = 30.0, jitter_type: str = "full"):
        super().__init__(base, max_delay)
        self.jitter_type = jitter_type
        self.last_delay = base
    
    def get_delay(self, attempt: int) -> float:
        delay = self.base * (2 ** attempt)
        delay = min(delay, self.max_delay)
        
        if self.jitter_type == "full":
            delay = Jitter.full_jitter(delay)
        elif self.jitter_type == "equal":
            delay = Jitter.equal_jitter(delay)
        elif self.jitter_type == "decorrelated":
            delay = Jitter.decorrelated_jitter(delay, self.last_delay)
            self.last_delay = delay
        elif self.jitter_type == "none":
            delay = Jitter.no_jitter(delay)
        
        return delay

Handling Different Failure Types

Transient vs Permanent Errors

class RetryableError(Exception):
    """Transient errors that should be retried."""
    pass

class NonRetryableError(Exception):
    """Permanent errors that should not be retried."""
    pass

class ServiceUnavailableError(RetryableError):
    pass

class TimeoutError(RetryableError):
    pass

class RateLimitError(RetryableError):
    def __init__(self, retry_after: float = 60.0):
        self.retry_after = retry_after
        super().__init__(f"Rate limited, retry after {retry_after}s")

class ValidationError(NonRetryableError):
    pass

class AuthenticationError(NonRetryableError):
    pass

class NotFoundError(NonRetryableError):
    pass


class SmartRetryHandler:
    def __init__(self, config: RetryConfig):
        self.config = config
    
    def should_retry(self, exception: Exception, attempt: int) -> tuple[bool, float]:
        if attempt >= self.config.max_attempts:
            return False, 0
        
        if isinstance(exception, NonRetryableError):
            return False, 0
        
        if isinstance(exception, RateLimitError):
            return True, exception.retry_after
        
        if isinstance(exception, RetryableError):
            delay = calculate_delay(attempt, self.config)
            return True, delay
        
        if isinstance(exception, (TimeoutError, ConnectionError)):
            delay = calculate_delay(attempt, self.config)
            return True, delay
        
        return True, calculate_delay(attempt, self.config)

HTTP-Specific Retry Logic

import aiohttp

class HTTPRetryConfig:
    def __init__(
        self,
        max_attempts: int = 3,
        retry_on_status: Tuple[int, ...] = (429, 500, 502, 503, 504),
        retry_on_timeout: bool = True,
        **backoff_kwargs
    ):
        self.max_attempts = max_attempts
        self.retry_on_status = retry_on_status
        self.retry_on_timeout = retry_on_timeout
        self.backoff = RetryConfig(max_attempts=max_attempts, **backoff_kwargs)


async def fetch_with_retry(
    session: aiohttp.ClientSession,
    url: str,
    config: HTTPRetryConfig = None,
    **kwargs
) -> aiohttp.ClientResponse:
    config = config or HTTPRetryConfig()
    last_exception = None
    
    for attempt in range(config.max_attempts):
        try:
            async with session.get(url, **kwargs) as response:
                if response.status in config.retry_on_status:
                    if attempt < config.max_attempts - 1:
                        delay = calculate_delay(attempt, config.backoff)
                        if "Retry-After" in response.headers:
                            delay = float(response.headers["Retry-After"])
                        await asyncio.sleep(delay)
                        continue
                
                return response
                
        except asyncio.TimeoutError as e:
            last_exception = e
            if not config.retry_on_timeout or attempt >= config.max_attempts - 1:
                raise
            await asyncio.sleep(calculate_delay(attempt, config.backoff))
        
        except aiohttp.ClientError as e:
            last_exception = e
            if attempt < config.max_attempts - 1:
                await asyncio.sleep(calculate_delay(attempt, config.backoff))
            else:
                raise
    
    raise last_exception

Database Retry Logic

import asyncpg

class DatabaseRetryHandler:
    def __init__(self, config: RetryConfig):
        self.config = config
    
    async def execute_with_retry(
        self,
        conn: asyncpg.Connection,
        query: str,
        *args
    ):
        last_exception = None
        
        for attempt in range(self.config.max_attempts):
            try:
                return await conn.fetch(query, *args)
                
            except asyncpg.exceptions.ConnectionFailureError as e:
                last_exception = e
                if attempt < self.config.max_attempts - 1:
                    await asyncio.sleep(calculate_delay(attempt, self.config))
                    conn = await self._reconnect(conn)
                else:
                    raise
                
            except asyncpg.exceptions.SerializationError as e:
                last_exception = e
                if attempt < self.config.max_attempts - 1:
                    await asyncio.sleep(calculate_delay(attempt, self.config))
                else:
                    raise
            
            except Exception as e:
                raise
        
        raise last_exception
    
    async def _reconnect(self, old_conn):
        try:
            await old_conn.close()
        except:
            pass
        return await asyncpg.connect(
            host=old_conn._host,
            port=old_conn._port,
            database=old_conn._database,
            user=old_conn._user
        )

Circuit Breaker Integration

import asyncio

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: int = 60,
        expected_exception: Type[Exception] = Exception
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.expected_exception = expected_exception
        
        self.failure_count = 0
        self.last_failure_time = None
        self.state = "closed"  # closed, open, half-open
    
    def can_execute(self) -> bool:
        if self.state == "closed":
            return True
        
        if self.state == "open":
            if time.time() - self.last_failure_time >= self.recovery_timeout:
                self.state = "half-open"
                return True
            return False
        
        return True
    
    def record_success(self):
        self.failure_count = 0
        self.state = "closed"
    
    def record_failure(self):
        self.failure_count += .last_failure_time =1
        self time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = "open"


class RetryWithCircuitBreaker:
    def __init__(self, retry_config: RetryConfig, circuit_breaker: CircuitBreaker):
        self.retry_config = retry_config
        self.circuit_breaker = circuit_breaker
    
    async def execute(self, func: Callable, *args, **kwargs):
        if not self.circuit_breaker.can_execute():
            raise Exception("Circuit breaker is open")
        
        last_exception = None
        
        for attempt in range(self.retry_config.max_attempts):
            try:
                result = await func(*args, **kwargs)
                self.circuit_breaker.record_success()
                return result
                
            except self.circuit_breaker.expected_exception as e:
                last_exception = e
                self.circuit_breaker.record_failure()
                
                if attempt < self.retry_config.max_attempts - 1:
                    delay = calculate_delay(attempt, self.retry_config)
                    await asyncio.sleep(delay)
                else:
                    raise
        
        raise last_exception

Monitoring and Observability

import logging
from dataclasses import dataclass, field

@dataclass
class RetryMetrics:
    attempts: int = 0
    successes: int = 0
    failures: int = 0
    total_retries: int = 0
    total_delay: float = 0.0

class RetryWithMetrics:
    def __init__(self, config: RetryConfig, logger: logging.Logger = None):
        self.config = config
        self.logger = logger
        self.metrics = RetryMetrics()
    
    async def execute(self, func: Callable, *args, **kwargs):
        self.metrics.attempts += 1
        
        for attempt in range(self.config.max_attempts):
            try:
                result = await func(*args, **kwargs)
                self.metrics.successes += 1
                self.metrics.total_retries += attempt
                
                if attempt > 0:
                    self.logger.info(
                        f"Succeeded after {attempt + 1} attempts"
                    )
                
                return result
                
            except Exception as e:
                if attempt < self.config.max_attempts - 1:
                    delay = calculate_delay(attempt, self.config)
                    self.metrics.total_delay += delay
                    
                    self.logger.warning(
                        f"Attempt {attempt + 1} failed: {e}. "
                        f"Retrying in {delay:.2f}s"
                    )
                    
                    await asyncio.sleep(delay)
                else:
                    self.metrics.failures += 1
                    self.logger.error(
                        f"All {self.config.max_attempts} attempts failed: {e}"
                    )
                    raise
        
        raise

Best Practices

GOOD_PATTERNS = {
    "use_exponential_backoff": """
# Exponential backoff prevents thundering herd

✅ Good:
delay = base * (2 ** attempt)  # 0.1s, 0.2s, 0.4s, 0.8s...
delay = min(delay, max_delay)

❌ Bad:
delay = base * attempt  # 0.1s, 0.2s, 0.3s...
# Still causes stampede at higher loads
""",
    
    "add_jitter": """
# Jitter randomizes retry times to reduce collisions

✅ Good:
delay = delay * (0.5 + random.random())

❌ Bad:
# No jitter = synchronized retries
# All clients retry at exactly same time
""",
    
    "distinguish_error_types": """
# Only retry transient errors

✅ Good:
if isinstance(e, ValidationError):
    raise immediately
if isinstance(e, TimeoutError):
    retry with backoff

❌ Bad:
retry(Exception)  # Never retry everything!
"""
}

BAD_PATTERNS = {
    "retry_everything": """
❌ Bad:
# Retry authentication errors?
try:
    return await make_request()
except Exception:
    return await retry()  # Wrong!

# Authentication failures won't succeed on retry

✅ Good:
async def make_request():
    try:
        return await http.request()
    except TimeoutError:
        return await retry()
    except ValidationError:
        raise  # Don't retry validation
""",
    
    "no_max_attempts": """
❌ Bad:
while True:
    try:
        return await request()
    except:
        await asyncio.sleep(1)

# Infinite retry loop!

✅ Good:
for attempt in range(max_attempts):
    try:
        return await request()
    except:
        if attempt == max_attempts - 1:
            raise
        await sleep(delay)
""",
    
    "ignore_circuit_breaker": """
❌ Bad:
# Retry forever even when service is down
for i in range(1000):
    try:
        await failing_service()
    except:
        await sleep(backoff)

# Hammering a dead service!

✅ Good:
# Use circuit breaker
breaker = CircuitBreaker(failure_threshold=5)
async def call():
    if not breaker.can_execute():
        raise ServiceUnavailable()
    try:
        return await failing_service()
    except Exception as e:
        breaker.record_failure()
        raise
"""
}

Summary

The Retry Pattern with Exponential Backoff is essential for building resilient systems:

  • Exponential Backoff - Increase delay exponentially between retries (prevents overload)
  • Jitter - Add randomness to prevent synchronized retry storms
  • Error Classification - Distinguish between retryable and non-retryable errors
  • Circuit Breaker - Stop retrying when service is clearly down
  • Monitoring - Track retry success rates and delays

Key configuration tips:

  • Base delay: 100-500ms
  • Max attempts: 3-5
  • Max delay: 30-60 seconds
  • Jitter: Always enabled for production systems

The combination of retries, backoff, jitter, and circuit breakers provides defense in depth for distributed systems.

Comments

👍 Was this article helpful?