The Retry Pattern is a fundamental resilience pattern that handles transient failures by automatically retrying failed operations. When combined with exponential backoff and jitter, it becomes a powerful tool for building robust distributed systems.
When to Retry and When to Fail
Not every failure should be retried. The key distinction hinges on whether the failure is transient or permanent. Transient failures — such as network timeouts, database deadlocks, 503 Service Unavailable, or 429 Rate Limited — are retry-worthy because the underlying condition may resolve on its own. Permanent failures — including 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, and 422 Validation Error — should never be retried, as the same request will fail identically.
Exponential backoff is the standard approach: first retry after 100ms, then 200ms, 400ms, 800ms, and so on. Adding jitter prevents the thundering herd problem, where all retrying clients synchronize and hit the server simultaneously. Implement a retry budget that limits both total retry time (e.g., max 30 seconds) and the number of retries (e.g., 3-5 attempts). Integrate with a circuit breaker: if retries keep failing, open the circuit and fail fast. The most dangerous pattern is infinite retries with no backoff — this is a self-inflicted DDoS. Real-world SDKs such as the AWS SDKs implement exponential backoff with jitter by default.
Understanding the Retry Pattern
Why Retries Matter
┌─────────────────────────────────────────────────────────────────┐
│ Transient Failures Are Common │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Network │ │ Service │ │ Database │ │
│ │ Timeout │ │ Restart │ │ Lock Wait │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Usually recover quickly with retry! │
│ │
│ Statistics: │
│ - 60% of failures are transient │
│ - 90% succeed on retry within 3 attempts │
│ - Exponential backoff reduces load by 99% │
└─────────────────────────────────────────────────────────────────┘
Without Retry Pattern
┌─────────────────────────────────────────────────────────────────┐
│ Immediate Failure (No Retry) │
│ │
│ Request ──► ✗ Connection timeout │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ ERROR! │ User sees failure immediately │
│ │ User upset │ │
│ └─────────────┘ │
│ │
│ ✗ Wasted opportunity │
│ ✗ Poor user experience │
│ ✗ No recovery attempt │
└─────────────────────────────────────────────────────────────────┘
With Retry Pattern
┌─────────────────────────────────────────────────────────────────┐
│ Retry with Exponential Backoff │
│ │
│ Request ──► ✗ Connection timeout │
│ │ │
│ ▼ │
│ Wait 100ms ──► ✗ Still timeout │
│ │ │
│ ▼ │
│ Wait 200ms ──► ✗ Still timeout │
│ │ │
│ ▼ │
│ Wait 400ms ──► ✓ Success! │
│ │ │
│ ▼ │
│ ┌──────────┐│
│ │ SUCCESS! ││
│ └──────────┘│
│ │
│ ✓ Automatic recovery │
│ ✓ Better user experience │
│ ✓ Reduced load from retry storms │
└─────────────────────────────────────────────────────────────────┘
Implementation
Basic Retry with Exponential Backoff
import asyncio
import time
from functools import wraps
from typing import Callable, Type, Tuple
class RetryConfig:
def __init__(
self,
max_attempts: int = 3,
base_delay: float = 0.1,
max_delay: float = 30.0,
exponential_base: float = 2.0,
jitter: bool = True
):
self.max_attempts = max_attempts
self.base_delay = base_delay
self.max_delay = max_delay
self.exponential_base = exponential_base
self.jitter = jitter
def calculate_delay(attempt: int, config: RetryConfig) -> float:
delay = config.base_delay * (config.exponential_base ** attempt)
delay = min(delay, config.max_delay)
if config.jitter:
import random
delay = delay * (0.5 + random.random())
return delay
async def retry_async(
func: Callable,
*args,
config: RetryConfig = None,
exceptions: Tuple[Type[Exception], ...] = (Exception,),
**kwargs
):
config = config or RetryConfig()
last_exception = None
for attempt in range(config.max_attempts):
try:
return await func(*args, **kwargs)
except exceptions as e:
last_exception = e
if attempt < config.max_attempts - 1:
delay = calculate_delay(attempt, config)
await asyncio.sleep(delay)
else:
raise last_exception
raise last_exception
def retry_decorator(config: RetryConfig = None):
def decorator(func: Callable):
@wraps(func)
async def async_wrapper(*args, **kwargs):
return await retry_async(func, *args, config=config, **kwargs)
@wraps(func)
def sync_wrapper(*args, **kwargs):
config = config or RetryConfig()
last_exception = None
for attempt in range(config.max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
if attempt < config.max_attempts - 1:
delay = calculate_delay(attempt, config)
time.sleep(delay)
raise last_exception
if asyncio.iscoroutinefunction(func):
return async_wrapper
return sync_wrapper
return decorator
Configurable Retry Strategies
class RetryStrategy:
def get_delay(self, attempt: int) -> float:
raise NotImplementedError
class ExponentialBackoff(RetryStrategy):
def __init__(
self,
base: float = 0.1,
max_delay: float = 30.0,
multiplier: float = 2.0
):
self.base = base
self.max_delay = max_delay
self.multiplier = multiplier
def get_delay(self, attempt: int) -> float:
delay = self.base * (self.multiplier ** attempt)
return min(delay, self.max_delay)
class LinearBackoff(RetryStrategy):
def __init__(self, base: float = 0.1, increment: float = 0.1):
self.base = base
self.increment = increment
def get_delay(self, attempt: int) -> float:
return self.base + (attempt * self.increment)
class ConstantBackoff(RetryStrategy):
def __init__(self, delay: float = 1.0):
self.delay = delay
def get_delay(self, attempt: int) -> float:
return self.delay
class FibonacciBackoff(RetryStrategy):
def __init__(self, multiplier: float = 1.0):
self.multiplier = multiplier
self._cache = {0: 1, 1: 1}
def _fib(self, n: int) -> float:
if n in self._cache:
return self._cache[n]
self._cache[n] = self._fib(n-1) + self._fib(n-2)
return self._cache[n]
def get_delay(self, attempt: int) -> float:
return self._fib(attempt) * self.multiplier
Jitter Strategies
import random
import math
class Jitter:
@staticmethod
def no_jitter(delay: float) -> float:
return delay
@staticmethod
def full_jitter(delay: float) -> float:
return delay * random.random()
@staticmethod
def equal_jitter(delay: float) -> float:
return delay / 2 + (delay / 2) * random.random()
@staticmethod
def decorrelated_jitter(delay: float, last_delay: float = None) -> float:
if last_delay is None:
last_delay = delay
new_delay = last_delay * random.uniform(1.3, 2.0)
return min(new_delay, 30.0)
class ExponentialBackoffWithJitter(ExponentialBackoff):
def __init__(self, base: float = 0.1, max_delay: float = 30.0, jitter_type: str = "full"):
super().__init__(base, max_delay)
self.jitter_type = jitter_type
self.last_delay = base
def get_delay(self, attempt: int) -> float:
delay = self.base * (2 ** attempt)
delay = min(delay, self.max_delay)
if self.jitter_type == "full":
delay = Jitter.full_jitter(delay)
elif self.jitter_type == "equal":
delay = Jitter.equal_jitter(delay)
elif self.jitter_type == "decorrelated":
delay = Jitter.decorrelated_jitter(delay, self.last_delay)
self.last_delay = delay
elif self.jitter_type == "none":
delay = Jitter.no_jitter(delay)
return delay
Handling Different Failure Types
Transient vs Permanent Errors
class RetryableError(Exception):
"""Transient errors that should be retried."""
pass
class NonRetryableError(Exception):
"""Permanent errors that should not be retried."""
pass
class ServiceUnavailableError(RetryableError):
pass
class TimeoutError(RetryableError):
pass
class RateLimitError(RetryableError):
def __init__(self, retry_after: float = 60.0):
self.retry_after = retry_after
super().__init__(f"Rate limited, retry after {retry_after}s")
class ValidationError(NonRetryableError):
pass
class AuthenticationError(NonRetryableError):
pass
class NotFoundError(NonRetryableError):
pass
class SmartRetryHandler:
def __init__(self, config: RetryConfig):
self.config = config
def should_retry(self, exception: Exception, attempt: int) -> tuple[bool, float]:
if attempt >= self.config.max_attempts:
return False, 0
if isinstance(exception, NonRetryableError):
return False, 0
if isinstance(exception, RateLimitError):
return True, exception.retry_after
if isinstance(exception, RetryableError):
delay = calculate_delay(attempt, self.config)
return True, delay
if isinstance(exception, (TimeoutError, ConnectionError)):
delay = calculate_delay(attempt, self.config)
return True, delay
return True, calculate_delay(attempt, self.config)
HTTP-Specific Retry Logic
import aiohttp
class HTTPRetryConfig:
def __init__(
self,
max_attempts: int = 3,
retry_on_status: Tuple[int, ...] = (429, 500, 502, 503, 504),
retry_on_timeout: bool = True,
**backoff_kwargs
):
self.max_attempts = max_attempts
self.retry_on_status = retry_on_status
self.retry_on_timeout = retry_on_timeout
self.backoff = RetryConfig(max_attempts=max_attempts, **backoff_kwargs)
async def fetch_with_retry(
session: aiohttp.ClientSession,
url: str,
config: HTTPRetryConfig = None,
**kwargs
) -> aiohttp.ClientResponse:
config = config or HTTPRetryConfig()
last_exception = None
for attempt in range(config.max_attempts):
try:
async with session.get(url, **kwargs) as response:
if response.status in config.retry_on_status:
if attempt < config.max_attempts - 1:
delay = calculate_delay(attempt, config.backoff)
if "Retry-After" in response.headers:
delay = float(response.headers["Retry-After"])
await asyncio.sleep(delay)
continue
return response
except asyncio.TimeoutError as e:
last_exception = e
if not config.retry_on_timeout or attempt >= config.max_attempts - 1:
raise
await asyncio.sleep(calculate_delay(attempt, config.backoff))
except aiohttp.ClientError as e:
last_exception = e
if attempt < config.max_attempts - 1:
await asyncio.sleep(calculate_delay(attempt, config.backoff))
else:
raise
raise last_exception
Database Retry Logic
import asyncpg
class DatabaseRetryHandler:
def __init__(self, config: RetryConfig):
self.config = config
async def execute_with_retry(
self,
conn: asyncpg.Connection,
query: str,
*args
):
last_exception = None
for attempt in range(self.config.max_attempts):
try:
return await conn.fetch(query, *args)
except asyncpg.exceptions.ConnectionFailureError as e:
last_exception = e
if attempt < self.config.max_attempts - 1:
await asyncio.sleep(calculate_delay(attempt, self.config))
conn = await self._reconnect(conn)
else:
raise
except asyncpg.exceptions.SerializationError as e:
last_exception = e
if attempt < self.config.max_attempts - 1:
await asyncio.sleep(calculate_delay(attempt, self.config))
else:
raise
except Exception as e:
raise
raise last_exception
async def _reconnect(self, old_conn):
try:
await old_conn.close()
except:
pass
return await asyncpg.connect(
host=old_conn._host,
port=old_conn._port,
database=old_conn._database,
user=old_conn._user
)
Circuit Breaker Integration
import asyncio
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: int = 60,
expected_exception: Type[Exception] = Exception
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.last_failure_time = None
self.state = "closed" # closed, open, half-open
def can_execute(self) -> bool:
if self.state == "closed":
return True
if self.state == "open":
if time.time() - self.last_failure_time >= self.recovery_timeout:
self.state = "half-open"
return True
return False
return True
def record_success(self):
self.failure_count = 0
self.state = "closed"
def record_failure(self):
self.failure_count += .last_failure_time =1
self time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
class RetryWithCircuitBreaker:
def __init__(self, retry_config: RetryConfig, circuit_breaker: CircuitBreaker):
self.retry_config = retry_config
self.circuit_breaker = circuit_breaker
async def execute(self, func: Callable, *args, **kwargs):
if not self.circuit_breaker.can_execute():
raise Exception("Circuit breaker is open")
last_exception = None
for attempt in range(self.retry_config.max_attempts):
try:
result = await func(*args, **kwargs)
self.circuit_breaker.record_success()
return result
except self.circuit_breaker.expected_exception as e:
last_exception = e
self.circuit_breaker.record_failure()
if attempt < self.retry_config.max_attempts - 1:
delay = calculate_delay(attempt, self.retry_config)
await asyncio.sleep(delay)
else:
raise
raise last_exception
Monitoring and Observability
import logging
from dataclasses import dataclass, field
@dataclass
class RetryMetrics:
attempts: int = 0
successes: int = 0
failures: int = 0
total_retries: int = 0
total_delay: float = 0.0
class RetryWithMetrics:
def __init__(self, config: RetryConfig, logger: logging.Logger = None):
self.config = config
self.logger = logger
self.metrics = RetryMetrics()
async def execute(self, func: Callable, *args, **kwargs):
self.metrics.attempts += 1
for attempt in range(self.config.max_attempts):
try:
result = await func(*args, **kwargs)
self.metrics.successes += 1
self.metrics.total_retries += attempt
if attempt > 0:
self.logger.info(
f"Succeeded after {attempt + 1} attempts"
)
return result
except Exception as e:
if attempt < self.config.max_attempts - 1:
delay = calculate_delay(attempt, self.config)
self.metrics.total_delay += delay
self.logger.warning(
f"Attempt {attempt + 1} failed: {e}. "
f"Retrying in {delay:.2f}s"
)
await asyncio.sleep(delay)
else:
self.metrics.failures += 1
self.logger.error(
f"All {self.config.max_attempts} attempts failed: {e}"
)
raise
raise
Best Practices
GOOD_PATTERNS = {
"use_exponential_backoff": """
# Exponential backoff prevents thundering herd
✅ Good:
delay = base * (2 ** attempt) # 0.1s, 0.2s, 0.4s, 0.8s...
delay = min(delay, max_delay)
❌ Bad:
delay = base * attempt # 0.1s, 0.2s, 0.3s...
# Still causes stampede at higher loads
""",
"add_jitter": """
# Jitter randomizes retry times to reduce collisions
✅ Good:
delay = delay * (0.5 + random.random())
❌ Bad:
# No jitter = synchronized retries
# All clients retry at exactly same time
""",
"distinguish_error_types": """
# Only retry transient errors
✅ Good:
if isinstance(e, ValidationError):
raise immediately
if isinstance(e, TimeoutError):
retry with backoff
❌ Bad:
retry(Exception) # Never retry everything!
"""
}
BAD_PATTERNS = {
"retry_everything": """
❌ Bad:
# Retry authentication errors?
try:
return await make_request()
except Exception:
return await retry() # Wrong!
# Authentication failures won't succeed on retry
✅ Good:
async def make_request():
try:
return await http.request()
except TimeoutError:
return await retry()
except ValidationError:
raise # Don't retry validation
""",
"no_max_attempts": """
❌ Bad:
while True:
try:
return await request()
except:
await asyncio.sleep(1)
# Infinite retry loop!
✅ Good:
for attempt in range(max_attempts):
try:
return await request()
except:
if attempt == max_attempts - 1:
raise
await sleep(delay)
""",
"ignore_circuit_breaker": """
❌ Bad:
# Retry forever even when service is down
for i in range(1000):
try:
await failing_service()
except:
await sleep(backoff)
# Hammering a dead service!
✅ Good:
# Use circuit breaker
breaker = CircuitBreaker(failure_threshold=5)
async def call():
if not breaker.can_execute():
raise ServiceUnavailable()
try:
return await failing_service()
except Exception as e:
breaker.record_failure()
raise
"""
}
Related Articles
Summary
The Retry Pattern with Exponential Backoff is essential for building resilient systems:
- Exponential Backoff - Increase delay exponentially between retries (prevents overload)
- Jitter - Add randomness to prevent synchronized retry storms
- Error Classification - Distinguish between retryable and non-retryable errors
- Circuit Breaker - Stop retrying when service is clearly down
- Monitoring - Track retry success rates and delays
Key configuration tips:
- Base delay: 100-500ms
- Max attempts: 3-5
- Max delay: 30-60 seconds
- Jitter: Always enabled for production systems
The combination of retries, backoff, jitter, and circuit breakers provides defense in depth for distributed systems.
Comments