The Retry Pattern is a fundamental resilience pattern that handles transient failures by automatically retrying failed operations. When combined with exponential backoff and jitter, it becomes a powerful tool for building robust distributed systems.
Understanding the Retry Pattern
Why Retries Matter
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Transient Failures Are Common โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Network โ โ Service โ โ Database โ โ
โ โ Timeout โ โ Restart โ โ Lock Wait โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โผ โผ โผ โ
โ Usually recover quickly with retry! โ
โ โ
โ Statistics: โ
โ - 60% of failures are transient โ
โ - 90% succeed on retry within 3 attempts โ
โ - Exponential backoff reduces load by 99% โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Without Retry Pattern
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Immediate Failure (No Retry) โ
โ โ
โ Request โโโบ โ Connection timeout โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโ โ
โ โ ERROR! โ User sees failure immediately โ
โ โ User upset โ โ
โ โโโโโโโโโโโโโโโ โ
โ โ
โ โ Wasted opportunity โ
โ โ Poor user experience โ
โ โ No recovery attempt โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
With Retry Pattern
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Retry with Exponential Backoff โ
โ โ
โ Request โโโบ โ Connection timeout โ
โ โ โ
โ โผ โ
โ Wait 100ms โโโบ โ Still timeout โ
โ โ โ
โ โผ โ
โ Wait 200ms โโโบ โ Still timeout โ
โ โ โ
โ โผ โ
โ Wait 400ms โโโบ โ Success! โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโ
โ โ SUCCESS! โโ
โ โโโโโโโโโโโโโ
โ โ
โ โ Automatic recovery โ
โ โ Better user experience โ
โ โ Reduced load from retry storms โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implementation
Basic Retry with Exponential Backoff
import asyncio
import time
from functools import wraps
from typing import Callable, Type, Tuple
class RetryConfig:
def __init__(
self,
max_attempts: int = 3,
base_delay: float = 0.1,
max_delay: float = 30.0,
exponential_base: float = 2.0,
jitter: bool = True
):
self.max_attempts = max_attempts
self.base_delay = base_delay
self.max_delay = max_delay
self.exponential_base = exponential_base
self.jitter = jitter
def calculate_delay(attempt: int, config: RetryConfig) -> float:
delay = config.base_delay * (config.exponential_base ** attempt)
delay = min(delay, config.max_delay)
if config.jitter:
import random
delay = delay * (0.5 + random.random())
return delay
async def retry_async(
func: Callable,
*args,
config: RetryConfig = None,
exceptions: Tuple[Type[Exception], ...] = (Exception,),
**kwargs
):
config = config or RetryConfig()
last_exception = None
for attempt in range(config.max_attempts):
try:
return await func(*args, **kwargs)
except exceptions as e:
last_exception = e
if attempt < config.max_attempts - 1:
delay = calculate_delay(attempt, config)
await asyncio.sleep(delay)
else:
raise last_exception
raise last_exception
def retry_decorator(config: RetryConfig = None):
def decorator(func: Callable):
@wraps(func)
async def async_wrapper(*args, **kwargs):
return await retry_async(func, *args, config=config, **kwargs)
@wraps(func)
def sync_wrapper(*args, **kwargs):
config = config or RetryConfig()
last_exception = None
for attempt in range(config.max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
if attempt < config.max_attempts - 1:
delay = calculate_delay(attempt, config)
time.sleep(delay)
raise last_exception
if asyncio.iscoroutinefunction(func):
return async_wrapper
return sync_wrapper
return decorator
Configurable Retry Strategies
class RetryStrategy:
def get_delay(self, attempt: int) -> float:
raise NotImplementedError
class ExponentialBackoff(RetryStrategy):
def __init__(
self,
base: float = 0.1,
max_delay: float = 30.0,
multiplier: float = 2.0
):
self.base = base
self.max_delay = max_delay
self.multiplier = multiplier
def get_delay(self, attempt: int) -> float:
delay = self.base * (self.multiplier ** attempt)
return min(delay, self.max_delay)
class LinearBackoff(RetryStrategy):
def __init__(self, base: float = 0.1, increment: float = 0.1):
self.base = base
self.increment = increment
def get_delay(self, attempt: int) -> float:
return self.base + (attempt * self.increment)
class ConstantBackoff(RetryStrategy):
def __init__(self, delay: float = 1.0):
self.delay = delay
def get_delay(self, attempt: int) -> float:
return self.delay
class FibonacciBackoff(RetryStrategy):
def __init__(self, multiplier: float = 1.0):
self.multiplier = multiplier
self._cache = {0: 1, 1: 1}
def _fib(self, n: int) -> float:
if n in self._cache:
return self._cache[n]
self._cache[n] = self._fib(n-1) + self._fib(n-2)
return self._cache[n]
def get_delay(self, attempt: int) -> float:
return self._fib(attempt) * self.multiplier
Jitter Strategies
import random
import math
class Jitter:
@staticmethod
def no_jitter(delay: float) -> float:
return delay
@staticmethod
def full_jitter(delay: float) -> float:
return delay * random.random()
@staticmethod
def equal_jitter(delay: float) -> float:
return delay / 2 + (delay / 2) * random.random()
@staticmethod
def decorrelated_jitter(delay: float, last_delay: float = None) -> float:
if last_delay is None:
last_delay = delay
new_delay = last_delay * random.uniform(1.3, 2.0)
return min(new_delay, 30.0)
class ExponentialBackoffWithJitter(ExponentialBackoff):
def __init__(self, base: float = 0.1, max_delay: float = 30.0, jitter_type: str = "full"):
super().__init__(base, max_delay)
self.jitter_type = jitter_type
self.last_delay = base
def get_delay(self, attempt: int) -> float:
delay = self.base * (2 ** attempt)
delay = min(delay, self.max_delay)
if self.jitter_type == "full":
delay = Jitter.full_jitter(delay)
elif self.jitter_type == "equal":
delay = Jitter.equal_jitter(delay)
elif self.jitter_type == "decorrelated":
delay = Jitter.decorrelated_jitter(delay, self.last_delay)
self.last_delay = delay
elif self.jitter_type == "none":
delay = Jitter.no_jitter(delay)
return delay
Handling Different Failure Types
Transient vs Permanent Errors
class RetryableError(Exception):
"""Transient errors that should be retried."""
pass
class NonRetryableError(Exception):
"""Permanent errors that should not be retried."""
pass
class ServiceUnavailableError(RetryableError):
pass
class TimeoutError(RetryableError):
pass
class RateLimitError(RetryableError):
def __init__(self, retry_after: float = 60.0):
self.retry_after = retry_after
super().__init__(f"Rate limited, retry after {retry_after}s")
class ValidationError(NonRetryableError):
pass
class AuthenticationError(NonRetryableError):
pass
class NotFoundError(NonRetryableError):
pass
class SmartRetryHandler:
def __init__(self, config: RetryConfig):
self.config = config
def should_retry(self, exception: Exception, attempt: int) -> tuple[bool, float]:
if attempt >= self.config.max_attempts:
return False, 0
if isinstance(exception, NonRetryableError):
return False, 0
if isinstance(exception, RateLimitError):
return True, exception.retry_after
if isinstance(exception, RetryableError):
delay = calculate_delay(attempt, self.config)
return True, delay
if isinstance(exception, (TimeoutError, ConnectionError)):
delay = calculate_delay(attempt, self.config)
return True, delay
return True, calculate_delay(attempt, self.config)
HTTP-Specific Retry Logic
import aiohttp
class HTTPRetryConfig:
def __init__(
self,
max_attempts: int = 3,
retry_on_status: Tuple[int, ...] = (429, 500, 502, 503, 504),
retry_on_timeout: bool = True,
**backoff_kwargs
):
self.max_attempts = max_attempts
self.retry_on_status = retry_on_status
self.retry_on_timeout = retry_on_timeout
self.backoff = RetryConfig(max_attempts=max_attempts, **backoff_kwargs)
async def fetch_with_retry(
session: aiohttp.ClientSession,
url: str,
config: HTTPRetryConfig = None,
**kwargs
) -> aiohttp.ClientResponse:
config = config or HTTPRetryConfig()
last_exception = None
for attempt in range(config.max_attempts):
try:
async with session.get(url, **kwargs) as response:
if response.status in config.retry_on_status:
if attempt < config.max_attempts - 1:
delay = calculate_delay(attempt, config.backoff)
if "Retry-After" in response.headers:
delay = float(response.headers["Retry-After"])
await asyncio.sleep(delay)
continue
return response
except asyncio.TimeoutError as e:
last_exception = e
if not config.retry_on_timeout or attempt >= config.max_attempts - 1:
raise
await asyncio.sleep(calculate_delay(attempt, config.backoff))
except aiohttp.ClientError as e:
last_exception = e
if attempt < config.max_attempts - 1:
await asyncio.sleep(calculate_delay(attempt, config.backoff))
else:
raise
raise last_exception
Database Retry Logic
import asyncpg
class DatabaseRetryHandler:
def __init__(self, config: RetryConfig):
self.config = config
async def execute_with_retry(
self,
conn: asyncpg.Connection,
query: str,
*args
):
last_exception = None
for attempt in range(self.config.max_attempts):
try:
return await conn.fetch(query, *args)
except asyncpg.exceptions.ConnectionFailureError as e:
last_exception = e
if attempt < self.config.max_attempts - 1:
await asyncio.sleep(calculate_delay(attempt, self.config))
conn = await self._reconnect(conn)
else:
raise
except asyncpg.exceptions.SerializationError as e:
last_exception = e
if attempt < self.config.max_attempts - 1:
await asyncio.sleep(calculate_delay(attempt, self.config))
else:
raise
except Exception as e:
raise
raise last_exception
async def _reconnect(self, old_conn):
try:
await old_conn.close()
except:
pass
return await asyncpg.connect(
host=old_conn._host,
port=old_conn._port,
database=old_conn._database,
user=old_conn._user
)
Circuit Breaker Integration
import asyncio
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: int = 60,
expected_exception: Type[Exception] = Exception
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.last_failure_time = None
self.state = "closed" # closed, open, half-open
def can_execute(self) -> bool:
if self.state == "closed":
return True
if self.state == "open":
if time.time() - self.last_failure_time >= self.recovery_timeout:
self.state = "half-open"
return True
return False
return True
def record_success(self):
self.failure_count = 0
self.state = "closed"
def record_failure(self):
self.failure_count += .last_failure_time =1
self time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
class RetryWithCircuitBreaker:
def __init__(self, retry_config: RetryConfig, circuit_breaker: CircuitBreaker):
self.retry_config = retry_config
self.circuit_breaker = circuit_breaker
async def execute(self, func: Callable, *args, **kwargs):
if not self.circuit_breaker.can_execute():
raise Exception("Circuit breaker is open")
last_exception = None
for attempt in range(self.retry_config.max_attempts):
try:
result = await func(*args, **kwargs)
self.circuit_breaker.record_success()
return result
except self.circuit_breaker.expected_exception as e:
last_exception = e
self.circuit_breaker.record_failure()
if attempt < self.retry_config.max_attempts - 1:
delay = calculate_delay(attempt, self.retry_config)
await asyncio.sleep(delay)
else:
raise
raise last_exception
Monitoring and Observability
import logging
from dataclasses import dataclass, field
@dataclass
class RetryMetrics:
attempts: int = 0
successes: int = 0
failures: int = 0
total_retries: int = 0
total_delay: float = 0.0
class RetryWithMetrics:
def __init__(self, config: RetryConfig, logger: logging.Logger = None):
self.config = config
self.logger = logger
self.metrics = RetryMetrics()
async def execute(self, func: Callable, *args, **kwargs):
self.metrics.attempts += 1
for attempt in range(self.config.max_attempts):
try:
result = await func(*args, **kwargs)
self.metrics.successes += 1
self.metrics.total_retries += attempt
if attempt > 0:
self.logger.info(
f"Succeeded after {attempt + 1} attempts"
)
return result
except Exception as e:
if attempt < self.config.max_attempts - 1:
delay = calculate_delay(attempt, self.config)
self.metrics.total_delay += delay
self.logger.warning(
f"Attempt {attempt + 1} failed: {e}. "
f"Retrying in {delay:.2f}s"
)
await asyncio.sleep(delay)
else:
self.metrics.failures += 1
self.logger.error(
f"All {self.config.max_attempts} attempts failed: {e}"
)
raise
raise
Best Practices
GOOD_PATTERNS = {
"use_exponential_backoff": """
# Exponential backoff prevents thundering herd
โ
Good:
delay = base * (2 ** attempt) # 0.1s, 0.2s, 0.4s, 0.8s...
delay = min(delay, max_delay)
โ Bad:
delay = base * attempt # 0.1s, 0.2s, 0.3s...
# Still causes stampede at higher loads
""",
"add_jitter": """
# Jitter randomizes retry times to reduce collisions
โ
Good:
delay = delay * (0.5 + random.random())
โ Bad:
# No jitter = synchronized retries
# All clients retry at exactly same time
""",
"distinguish_error_types": """
# Only retry transient errors
โ
Good:
if isinstance(e, ValidationError):
raise immediately
if isinstance(e, TimeoutError):
retry with backoff
โ Bad:
retry(Exception) # Never retry everything!
"""
}
BAD_PATTERNS = {
"retry_everything": """
โ Bad:
# Retry authentication errors?
try:
return await make_request()
except Exception:
return await retry() # Wrong!
# Authentication failures won't succeed on retry
โ
Good:
async def make_request():
try:
return await http.request()
except TimeoutError:
return await retry()
except ValidationError:
raise # Don't retry validation
""",
"no_max_attempts": """
โ Bad:
while True:
try:
return await request()
except:
await asyncio.sleep(1)
# Infinite retry loop!
โ
Good:
for attempt in range(max_attempts):
try:
return await request()
except:
if attempt == max_attempts - 1:
raise
await sleep(delay)
""",
"ignore_circuit_breaker": """
โ Bad:
# Retry forever even when service is down
for i in range(1000):
try:
await failing_service()
except:
await sleep(backoff)
# Hammering a dead service!
โ
Good:
# Use circuit breaker
breaker = CircuitBreaker(failure_threshold=5)
async def call():
if not breaker.can_execute():
raise ServiceUnavailable()
try:
return await failing_service()
except Exception as e:
breaker.record_failure()
raise
"""
}
Summary
The Retry Pattern with Exponential Backoff is essential for building resilient systems:
- Exponential Backoff - Increase delay exponentially between retries (prevents overload)
- Jitter - Add randomness to prevent synchronized retry storms
- Error Classification - Distinguish between retryable and non-retryable errors
- Circuit Breaker - Stop retrying when service is clearly down
- Monitoring - Track retry success rates and delays
Key configuration tips:
- Base delay: 100-500ms
- Max attempts: 3-5
- Max delay: 30-60 seconds
- Jitter: Always enabled for production systems
The combination of retries, backoff, jitter, and circuit breakers provides defense in depth for distributed systems.
Comments