Skip to main content
โšก Calmops

Retry Pattern with Exponential Backoff: Resilient Error Recovery

The Retry Pattern is a fundamental resilience pattern that handles transient failures by automatically retrying failed operations. When combined with exponential backoff and jitter, it becomes a powerful tool for building robust distributed systems.

Understanding the Retry Pattern

Why Retries Matter

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Transient Failures Are Common                       โ”‚
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  Network     โ”‚    โ”‚  Service     โ”‚    โ”‚  Database    โ”‚    โ”‚
โ”‚  โ”‚  Timeout     โ”‚    โ”‚  Restart     โ”‚    โ”‚  Lock Wait   โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚        โ”‚                  โ”‚                   โ”‚                โ”‚
โ”‚        โ–ผ                  โ–ผ                   โ–ผ                โ”‚
โ”‚   Usually recover quickly with retry!                          โ”‚
โ”‚                                                                 โ”‚
โ”‚  Statistics:                                                   โ”‚
โ”‚  - 60% of failures are transient                               โ”‚
โ”‚  - 90% succeed on retry within 3 attempts                      โ”‚
โ”‚  - Exponential backoff reduces load by 99%                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Without Retry Pattern

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Immediate Failure (No Retry)                         โ”‚
โ”‚                                                                 โ”‚
โ”‚  Request โ”€โ”€โ–บ โœ— Connection timeout                               โ”‚
โ”‚                โ”‚                                                โ”‚
โ”‚                โ–ผ                                                โ”‚
โ”‚         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                         โ”‚
โ”‚         โ”‚   ERROR!    โ”‚  User sees failure immediately          โ”‚
โ”‚         โ”‚  User upset โ”‚                                         โ”‚
โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                         โ”‚
โ”‚                                                                 โ”‚
โ”‚  โœ— Wasted opportunity                                          โ”‚
โ”‚  โœ— Poor user experience                                         โ”‚
โ”‚  โœ— No recovery attempt                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

With Retry Pattern

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Retry with Exponential Backoff                      โ”‚
โ”‚                                                                 โ”‚
โ”‚  Request โ”€โ”€โ–บ โœ— Connection timeout                               โ”‚
โ”‚                โ”‚                                                โ”‚
โ”‚                โ–ผ                                                โ”‚
โ”‚         Wait 100ms โ”€โ”€โ–บ โœ— Still timeout                         โ”‚
โ”‚                           โ”‚                                     โ”‚
โ”‚                           โ–ผ                                     โ”‚
โ”‚                 Wait 200ms โ”€โ”€โ–บ โœ— Still timeout                 โ”‚
โ”‚                                   โ”‚                             โ”‚
โ”‚                                   โ–ผ                             โ”‚
โ”‚                         Wait 400ms โ”€โ”€โ–บ โœ“ Success!               โ”‚
โ”‚                                                            โ”‚    โ”‚
โ”‚                                                            โ–ผ    โ”‚
โ”‚                                                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚                                                     โ”‚ SUCCESS! โ”‚โ”‚
โ”‚                                                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ”‚                                                                 โ”‚
โ”‚  โœ“ Automatic recovery                                         โ”‚
โ”‚  โœ“ Better user experience                                     โ”‚
โ”‚  โœ“ Reduced load from retry storms                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Implementation

Basic Retry with Exponential Backoff

import asyncio
import time
from functools import wraps
from typing import Callable, Type, Tuple

class RetryConfig:
    def __init__(
        self,
        max_attempts: int = 3,
        base_delay: float = 0.1,
        max_delay: float = 30.0,
        exponential_base: float = 2.0,
        jitter: bool = True
    ):
        self.max_attempts = max_attempts
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.exponential_base = exponential_base
        self.jitter = jitter


def calculate_delay(attempt: int, config: RetryConfig) -> float:
    delay = config.base_delay * (config.exponential_base ** attempt)
    delay = min(delay, config.max_delay)
    
    if config.jitter:
        import random
        delay = delay * (0.5 + random.random())
    
    return delay


async def retry_async(
    func: Callable,
    *args,
    config: RetryConfig = None,
    exceptions: Tuple[Type[Exception], ...] = (Exception,),
    **kwargs
):
    config = config or RetryConfig()
    last_exception = None
    
    for attempt in range(config.max_attempts):
        try:
            return await func(*args, **kwargs)
            
        except exceptions as e:
            last_exception = e
            
            if attempt < config.max_attempts - 1:
                delay = calculate_delay(attempt, config)
                await asyncio.sleep(delay)
            else:
                raise last_exception
    
    raise last_exception


def retry_decorator(config: RetryConfig = None):
    def decorator(func: Callable):
        @wraps(func)
        async def async_wrapper(*args, **kwargs):
            return await retry_async(func, *args, config=config, **kwargs)
        
        @wraps(func)
        def sync_wrapper(*args, **kwargs):
            config = config or RetryConfig()
            last_exception = None
            
            for attempt in range(config.max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    if attempt < config.max_attempts - 1:
                        delay = calculate_delay(attempt, config)
                        time.sleep(delay)
            
            raise last_exception
        
        if asyncio.iscoroutinefunction(func):
            return async_wrapper
        return sync_wrapper
    
    return decorator

Configurable Retry Strategies

class RetryStrategy:
    def get_delay(self, attempt: int) -> float:
        raise NotImplementedError


class ExponentialBackoff(RetryStrategy):
    def __init__(
        self, 
        base: float = 0.1, 
        max_delay: float = 30.0,
        multiplier: float = 2.0
    ):
        self.base = base
        self.max_delay = max_delay
        self.multiplier = multiplier
    
    def get_delay(self, attempt: int) -> float:
        delay = self.base * (self.multiplier ** attempt)
        return min(delay, self.max_delay)


class LinearBackoff(RetryStrategy):
    def __init__(self, base: float = 0.1, increment: float = 0.1):
        self.base = base
        self.increment = increment
    
    def get_delay(self, attempt: int) -> float:
        return self.base + (attempt * self.increment)


class ConstantBackoff(RetryStrategy):
    def __init__(self, delay: float = 1.0):
        self.delay = delay
    
    def get_delay(self, attempt: int) -> float:
        return self.delay


class FibonacciBackoff(RetryStrategy):
    def __init__(self, multiplier: float = 1.0):
        self.multiplier = multiplier
        self._cache = {0: 1, 1: 1}
    
    def _fib(self, n: int) -> float:
        if n in self._cache:
            return self._cache[n]
        self._cache[n] = self._fib(n-1) + self._fib(n-2)
        return self._cache[n]
    
    def get_delay(self, attempt: int) -> float:
        return self._fib(attempt) * self.multiplier

Jitter Strategies

import random
import math

class Jitter:
    @staticmethod
    def no_jitter(delay: float) -> float:
        return delay
    
    @staticmethod
    def full_jitter(delay: float) -> float:
        return delay * random.random()
    
    @staticmethod
    def equal_jitter(delay: float) -> float:
        return delay / 2 + (delay / 2) * random.random()
    
    @staticmethod
    def decorrelated_jitter(delay: float, last_delay: float = None) -> float:
        if last_delay is None:
            last_delay = delay
        new_delay = last_delay * random.uniform(1.3, 2.0)
        return min(new_delay, 30.0)


class ExponentialBackoffWithJitter(ExponentialBackoff):
    def __init__(self, base: float = 0.1, max_delay: float = 30.0, jitter_type: str = "full"):
        super().__init__(base, max_delay)
        self.jitter_type = jitter_type
        self.last_delay = base
    
    def get_delay(self, attempt: int) -> float:
        delay = self.base * (2 ** attempt)
        delay = min(delay, self.max_delay)
        
        if self.jitter_type == "full":
            delay = Jitter.full_jitter(delay)
        elif self.jitter_type == "equal":
            delay = Jitter.equal_jitter(delay)
        elif self.jitter_type == "decorrelated":
            delay = Jitter.decorrelated_jitter(delay, self.last_delay)
            self.last_delay = delay
        elif self.jitter_type == "none":
            delay = Jitter.no_jitter(delay)
        
        return delay

Handling Different Failure Types

Transient vs Permanent Errors

class RetryableError(Exception):
    """Transient errors that should be retried."""
    pass

class NonRetryableError(Exception):
    """Permanent errors that should not be retried."""
    pass

class ServiceUnavailableError(RetryableError):
    pass

class TimeoutError(RetryableError):
    pass

class RateLimitError(RetryableError):
    def __init__(self, retry_after: float = 60.0):
        self.retry_after = retry_after
        super().__init__(f"Rate limited, retry after {retry_after}s")

class ValidationError(NonRetryableError):
    pass

class AuthenticationError(NonRetryableError):
    pass

class NotFoundError(NonRetryableError):
    pass


class SmartRetryHandler:
    def __init__(self, config: RetryConfig):
        self.config = config
    
    def should_retry(self, exception: Exception, attempt: int) -> tuple[bool, float]:
        if attempt >= self.config.max_attempts:
            return False, 0
        
        if isinstance(exception, NonRetryableError):
            return False, 0
        
        if isinstance(exception, RateLimitError):
            return True, exception.retry_after
        
        if isinstance(exception, RetryableError):
            delay = calculate_delay(attempt, self.config)
            return True, delay
        
        if isinstance(exception, (TimeoutError, ConnectionError)):
            delay = calculate_delay(attempt, self.config)
            return True, delay
        
        return True, calculate_delay(attempt, self.config)

HTTP-Specific Retry Logic

import aiohttp

class HTTPRetryConfig:
    def __init__(
        self,
        max_attempts: int = 3,
        retry_on_status: Tuple[int, ...] = (429, 500, 502, 503, 504),
        retry_on_timeout: bool = True,
        **backoff_kwargs
    ):
        self.max_attempts = max_attempts
        self.retry_on_status = retry_on_status
        self.retry_on_timeout = retry_on_timeout
        self.backoff = RetryConfig(max_attempts=max_attempts, **backoff_kwargs)


async def fetch_with_retry(
    session: aiohttp.ClientSession,
    url: str,
    config: HTTPRetryConfig = None,
    **kwargs
) -> aiohttp.ClientResponse:
    config = config or HTTPRetryConfig()
    last_exception = None
    
    for attempt in range(config.max_attempts):
        try:
            async with session.get(url, **kwargs) as response:
                if response.status in config.retry_on_status:
                    if attempt < config.max_attempts - 1:
                        delay = calculate_delay(attempt, config.backoff)
                        if "Retry-After" in response.headers:
                            delay = float(response.headers["Retry-After"])
                        await asyncio.sleep(delay)
                        continue
                
                return response
                
        except asyncio.TimeoutError as e:
            last_exception = e
            if not config.retry_on_timeout or attempt >= config.max_attempts - 1:
                raise
            await asyncio.sleep(calculate_delay(attempt, config.backoff))
        
        except aiohttp.ClientError as e:
            last_exception = e
            if attempt < config.max_attempts - 1:
                await asyncio.sleep(calculate_delay(attempt, config.backoff))
            else:
                raise
    
    raise last_exception

Database Retry Logic

import asyncpg

class DatabaseRetryHandler:
    def __init__(self, config: RetryConfig):
        self.config = config
    
    async def execute_with_retry(
        self,
        conn: asyncpg.Connection,
        query: str,
        *args
    ):
        last_exception = None
        
        for attempt in range(self.config.max_attempts):
            try:
                return await conn.fetch(query, *args)
                
            except asyncpg.exceptions.ConnectionFailureError as e:
                last_exception = e
                if attempt < self.config.max_attempts - 1:
                    await asyncio.sleep(calculate_delay(attempt, self.config))
                    conn = await self._reconnect(conn)
                else:
                    raise
                
            except asyncpg.exceptions.SerializationError as e:
                last_exception = e
                if attempt < self.config.max_attempts - 1:
                    await asyncio.sleep(calculate_delay(attempt, self.config))
                else:
                    raise
            
            except Exception as e:
                raise
        
        raise last_exception
    
    async def _reconnect(self, old_conn):
        try:
            await old_conn.close()
        except:
            pass
        return await asyncpg.connect(
            host=old_conn._host,
            port=old_conn._port,
            database=old_conn._database,
            user=old_conn._user
        )

Circuit Breaker Integration

import asyncio

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: int = 60,
        expected_exception: Type[Exception] = Exception
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.expected_exception = expected_exception
        
        self.failure_count = 0
        self.last_failure_time = None
        self.state = "closed"  # closed, open, half-open
    
    def can_execute(self) -> bool:
        if self.state == "closed":
            return True
        
        if self.state == "open":
            if time.time() - self.last_failure_time >= self.recovery_timeout:
                self.state = "half-open"
                return True
            return False
        
        return True
    
    def record_success(self):
        self.failure_count = 0
        self.state = "closed"
    
    def record_failure(self):
        self.failure_count += .last_failure_time =1
        self time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = "open"


class RetryWithCircuitBreaker:
    def __init__(self, retry_config: RetryConfig, circuit_breaker: CircuitBreaker):
        self.retry_config = retry_config
        self.circuit_breaker = circuit_breaker
    
    async def execute(self, func: Callable, *args, **kwargs):
        if not self.circuit_breaker.can_execute():
            raise Exception("Circuit breaker is open")
        
        last_exception = None
        
        for attempt in range(self.retry_config.max_attempts):
            try:
                result = await func(*args, **kwargs)
                self.circuit_breaker.record_success()
                return result
                
            except self.circuit_breaker.expected_exception as e:
                last_exception = e
                self.circuit_breaker.record_failure()
                
                if attempt < self.retry_config.max_attempts - 1:
                    delay = calculate_delay(attempt, self.retry_config)
                    await asyncio.sleep(delay)
                else:
                    raise
        
        raise last_exception

Monitoring and Observability

import logging
from dataclasses import dataclass, field

@dataclass
class RetryMetrics:
    attempts: int = 0
    successes: int = 0
    failures: int = 0
    total_retries: int = 0
    total_delay: float = 0.0

class RetryWithMetrics:
    def __init__(self, config: RetryConfig, logger: logging.Logger = None):
        self.config = config
        self.logger = logger
        self.metrics = RetryMetrics()
    
    async def execute(self, func: Callable, *args, **kwargs):
        self.metrics.attempts += 1
        
        for attempt in range(self.config.max_attempts):
            try:
                result = await func(*args, **kwargs)
                self.metrics.successes += 1
                self.metrics.total_retries += attempt
                
                if attempt > 0:
                    self.logger.info(
                        f"Succeeded after {attempt + 1} attempts"
                    )
                
                return result
                
            except Exception as e:
                if attempt < self.config.max_attempts - 1:
                    delay = calculate_delay(attempt, self.config)
                    self.metrics.total_delay += delay
                    
                    self.logger.warning(
                        f"Attempt {attempt + 1} failed: {e}. "
                        f"Retrying in {delay:.2f}s"
                    )
                    
                    await asyncio.sleep(delay)
                else:
                    self.metrics.failures += 1
                    self.logger.error(
                        f"All {self.config.max_attempts} attempts failed: {e}"
                    )
                    raise
        
        raise

Best Practices

GOOD_PATTERNS = {
    "use_exponential_backoff": """
# Exponential backoff prevents thundering herd

โœ… Good:
delay = base * (2 ** attempt)  # 0.1s, 0.2s, 0.4s, 0.8s...
delay = min(delay, max_delay)

โŒ Bad:
delay = base * attempt  # 0.1s, 0.2s, 0.3s...
# Still causes stampede at higher loads
""",
    
    "add_jitter": """
# Jitter randomizes retry times to reduce collisions

โœ… Good:
delay = delay * (0.5 + random.random())

โŒ Bad:
# No jitter = synchronized retries
# All clients retry at exactly same time
""",
    
    "distinguish_error_types": """
# Only retry transient errors

โœ… Good:
if isinstance(e, ValidationError):
    raise immediately
if isinstance(e, TimeoutError):
    retry with backoff

โŒ Bad:
retry(Exception)  # Never retry everything!
"""
}

BAD_PATTERNS = {
    "retry_everything": """
โŒ Bad:
# Retry authentication errors?
try:
    return await make_request()
except Exception:
    return await retry()  # Wrong!

# Authentication failures won't succeed on retry

โœ… Good:
async def make_request():
    try:
        return await http.request()
    except TimeoutError:
        return await retry()
    except ValidationError:
        raise  # Don't retry validation
""",
    
    "no_max_attempts": """
โŒ Bad:
while True:
    try:
        return await request()
    except:
        await asyncio.sleep(1)

# Infinite retry loop!

โœ… Good:
for attempt in range(max_attempts):
    try:
        return await request()
    except:
        if attempt == max_attempts - 1:
            raise
        await sleep(delay)
""",
    
    "ignore_circuit_breaker": """
โŒ Bad:
# Retry forever even when service is down
for i in range(1000):
    try:
        await failing_service()
    except:
        await sleep(backoff)

# Hammering a dead service!

โœ… Good:
# Use circuit breaker
breaker = CircuitBreaker(failure_threshold=5)
async def call():
    if not breaker.can_execute():
        raise ServiceUnavailable()
    try:
        return await failing_service()
    except Exception as e:
        breaker.record_failure()
        raise
"""
}

Summary

The Retry Pattern with Exponential Backoff is essential for building resilient systems:

  • Exponential Backoff - Increase delay exponentially between retries (prevents overload)
  • Jitter - Add randomness to prevent synchronized retry storms
  • Error Classification - Distinguish between retryable and non-retryable errors
  • Circuit Breaker - Stop retrying when service is clearly down
  • Monitoring - Track retry success rates and delays

Key configuration tips:

  • Base delay: 100-500ms
  • Max attempts: 3-5
  • Max delay: 30-60 seconds
  • Jitter: Always enabled for production systems

The combination of retries, backoff, jitter, and circuit breakers provides defense in depth for distributed systems.

Comments