In microservices architectures, traditional ACID transactions don’t work across service boundaries. The Saga pattern provides a solution by managing distributed transactions through a sequence of local transactions with compensating actions.
In this guide, we’ll explore the Saga pattern, its two main approaches (choreography and orchestration), implementation strategies, and best practices.
The Distributed Transaction Problem
Why ACID Doesn’t Work
# Traditional ACID in microservices
problem:
description: "Multiple services, each with their own database"
example_order:
- "Order Service: INSERT order (local transaction)"
- "Payment Service: Process payment (different DB)"
- "Inventory Service: Reserve items (different DB)"
issue: "No single transaction can span all these services"
solutions:
- "Two-Phase Commit (2PC)" โ "Not recommended for microservices"
- "Saga Pattern" โ "Recommended approach"
Saga vs 2PC
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Transaction Approaches โ
โ โ
โ Two-Phase Commit (2PC): โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ Coordinator โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ Prepare โโโโบ All Services Lock Resources โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ Commit โโโโบ All Services Commit โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ (Or Rollback if any fails) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Problem: Blocks resources, single point of failure โ
โ โ
โ Saga Pattern: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ Service A โโโบ Service B โโโบ Service C โ โ
โ โ โ โ โ โ โ
โ โ โผ โผ โผ โ โ
โ โ Local TX Local TX Local TX โ โ
โ โ โ โ โ โ โ
โ โ โ โ โผ โ โ
โ โ โ โ Compensate โ โ
โ โ โ โโโโโโบ if needed โ โ
โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโบ Compensate โ โ
โ โ if needed โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Benefit: No blocking, each service owns its data โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Understanding Sagas
What is a Saga?
# A saga is a sequence of local transactions
# Each transaction has a compensating transaction
saga_definition = """
Saga = Sequence of Local Transactions + Compensating Transactions
Example: Order Placement
1. Create Order (local TX)
โ Compensate: Cancel Order
2. Reserve Inventory (local TX)
โ Compensate: Release Inventory
3. Process Payment (local TX)
โ Compensate: Refund Payment
4. Ship Order (local TX)
โ Compensate: Mark as Returned
If any step fails, execute compensating transactions in reverse.
"""
# Each step returns success or triggers compensation
class SagaStep:
def execute(self, data):
"""Execute the step"""
pass
def compensate(self, data):
"""Roll back the step if needed"""
pass
Saga Properties
# Saga characteristics
saga_properties = {
"atomicity": {
"description": "Saga either completes or compensates",
"not_guaranteed": "Unlike ACID, saga doesn't guarantee isolation"
},
"consistency": {
"description": "Data may be temporarily inconsistent",
"mitigation": "Use idempotency and eventual consistency patterns"
},
"isolation": {
"description": "Not guaranteed between sagas",
"mitigation": "Use optimistic concurrency, semantic locks"
},
"durability": {
"description": "Each local transaction is durable",
"mitigation": "Saga state persisted to database"
}
}
Choreography-Based Saga
Services communicate through events. Each service listens for events and publishes events when performing actions.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Choreography-Based Saga โ
โ โ
โ Order Service Payment Service Inventory Service โ
โ โ โ โ โ
โ โ OrderCreated โ โ โ
โ โโโโโโโโโโโโโโโโบ โ โ โ
โ โ โ โ โ
โ โ PaymentProcessed โ โ
โ โโโโโโโโโโโโโโโโโ โ โ โ
โ โ โ โ โ
โ โ โ InventoryReserved โ
โ โ โโโโโโโโโโโโโโโโโบโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โ OrderCompleted โ โ
โ โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implementation: Choreography
# Event classes
class Event:
def __init__(self, type, data):
self.type = type
self.data = data
# Order Service
class OrderService:
def __init__(self, event_bus):
self.event_bus = event_bus
def create_order(self, order_data):
# 1. Create order in pending state
order = Order(
id=generate_uuid(),
customer_id=order_data['customer_id'],
items=order_data['items'],
status='pending'
)
self.order_repo.save(order)
# 2. Publish OrderCreated event
event = Event('OrderCreated', {
'order_id': order.id,
'customer_id': order.customer_id,
'items': order.items,
'total': order.total
})
self.event_bus.publish(event)
return order
def handle_order_compensating(self, event_data):
# Handle compensation from other service
order = self.order_repo.find_by_id(event_data['order_id'])
order.status = 'cancelled'
order.cancellation_reason = 'payment_failed'
self.order_repo.save(order)
# Payment Service
class PaymentService:
def __init__(self, event_bus):
self.event_bus = event_bus
def handle_order_created(self, event_data):
try:
# Process payment
payment = self.payment_gateway.charge(
customer_id=event_data['customer_id'],
amount=event_data['total']
)
# Publish success event
event = Event('PaymentProcessed', {
'order_id': event_data['order_id'],
'payment_id': payment.id
})
self.event_bus.publish(event)
except PaymentFailed as e:
# Publish failure event for compensation
event = Event('PaymentFailed', {
'order_id': event_data['order_id'],
'reason': str(e)
})
self.event_bus.publish(event)
def handle_payment_compensating(self, event_data):
# Refund payment
payment = self.payment_gateway.refund(event_data['payment_id'])
# Inventory Service
class InventoryService:
def __init__(self, event_bus):
self.event_bus = event_bus
def handle_payment_processed(self, event_data):
try:
# Reserve inventory
for item in event_data['items']:
self.inventory.reserve(
product_id=item['product_id'],
quantity=item['quantity']
)
event = Event('InventoryReserved', {
'order_id': event_data['order_id']
})
self.event_bus.publish(event)
except OutOfStock as e:
event = Event('InventoryFailed', {
'order_id': event_data['order_id'],
'reason': str(e)
})
self.event_bus.publish(event)
def compensate_inventory(self, order_id):
# Release reserved inventory
self.inventory.release(order_id)
# Event Bus (Simple implementation)
class EventBus:
def __init__(self):
self.subscribers = {}
def subscribe(self, event_type, handler):
if event_type not in self.subscribers:
self.subscribers[event_type] = []
self.subscribers[event_type].append(handler)
def publish(self, event):
handlers = self.subscribers.get(event.type, [])
for handler in handlers:
handler(event.data)
Orchestration-Based Saga
A central coordinator (orchestrator) manages the saga execution. The orchestrator tells participants what to do and handles failures.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Orchestration-Based Saga โ
โ โ
โ Orchestrator โ
โ โ โ
โ โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ โ
โ โ Order โ โPayment โ โInventoryโ โ
โ โService โ โService โ โ Service โ โ
โ โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ โ
โ โ โ โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโโ โ
โ โ โ
โ โ โ
โ Results โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implementation: Orchestration
# Saga Orchestrator
class OrderSagaOrchestrator:
def __init__(self):
self.steps = [
SagaStep('create_order', self.create_order, self.compensate_order),
SagaStep('process_payment', self.process_payment, self.refund_payment),
SagaStep('reserve_inventory', self.reserve_inventory, self.release_inventory),
SagaStep('ship_order', self.ship_order, self.cancel_shipment)
]
def execute(self, order_data):
completed_steps = []
saga_data = {'order_data': order_data}
for step in self.steps:
try:
result = step.execute(saga_data)
saga_data[step.name] = result
completed_steps.append(step)
except Exception as e:
# Compensate in reverse order
self._compensate(completed_steps, saga_data, str(e))
raise SagaFailed(str(e))
return saga_data
def _compensate(self, completed_steps, saga_data, reason):
for step in reversed(completed_steps):
try:
step.compensate(saga_data)
except Exception as e:
# Log compensation failure
log.error(f"Compensation failed for {step.name}: {e}")
# Store for retry
self.compensation_queue.add({
'step': step.name,
'data': saga_data,
'reason': reason
})
# Step implementations
def create_order(self, data):
order = order_service.create_order(data['order_data'])
data['order_id'] = order.id
return order
def compensate_order(self, data):
if 'order_id' in data:
order_service.cancel_order(data['order_id'])
def process_payment(self, data):
payment = payment_service.charge(
customer_id=data['order_data']['customer_id'],
amount=data['order_data']['total']
)
data['payment_id'] = payment.id
return payment
def refund_payment(self, data):
if 'payment_id' in data:
payment_service.refund(data['payment_id'])
def reserve_inventory(self, data):
inventory_service.reserve_for_order(
order_id=data['order_id'],
items=data['order_data']['items']
)
def release_inventory(self, data):
inventory_service.release_for_order(data['order_id'])
def ship_order(self, data):
shipment = shipping_service.create_shipment(
order_id=data['order_id'],
address=data['order_data']['shipping_address']
)
data['tracking_number'] = shipment.tracking_number
return shipment
def cancel_shipment(self, data):
if 'tracking_number' in data:
shipping_service.cancel(data['tracking_number'])
Saga State Management
# Persist saga state for recovery
class SagaState:
PENDING = 'pending'
IN_PROGRESS = 'in_progress'
COMPLETED = 'completed'
COMPENSATING = 'compensating'
FAILED = 'failed'
class SagaRecord:
"""Database record for saga state"""
def __init__(self, saga_id, saga_type, current_step,
saga_data, status, compensation_data=None):
self.id = saga_id
self.type = saga_type
self.current_step = current_step
self.saga_data = saga_data # JSON
self.status = status
self.compensation_data = compensation_data # JSON
self.created_at = datetime.utcnow()
self.updated_at = datetime.utcnow()
class SagaStateStore:
def __init__(self, db):
self.db = db
def create(self, saga_id, saga_type, initial_data):
record = SagaRecord(
saga_id=saga_id,
saga_type=saga_type,
current_step=0,
saga_data=json.dumps(initial_data),
status=SagaState.PENDING
)
self._save(record)
return record
def update_step(self, saga_id, step, data):
record = self._find(saga_id)
record.current_step = step
record.saga_data = json.dumps(data)
record.updated_at = datetime.utcnow()
self._save(record)
def mark_compensating(self, saga_id, compensation_data):
record = self._find(saga_id)
record.status = SagaState.COMPENSATING
record.compensation_data = json.dumps(compensation_data)
self._save(record)
def mark_failed(self, saga_id, error):
record = self._find(saga_id)
record.status = SagaState.FAILED
record.error = error
self._save(record)
def mark_completed(self, saga_id):
record = self._find(saga_id)
record.status = SagaState.COMPENSATING
self._save(record)
Handling Failures
Compensating Transactions
# Compensation strategies
compensation_strategies = {
"retry": {
"description": "Retry failed compensation",
"use_when": "Temporary failures (network timeout)"
},
"manual": {
"description": "Alert operators for manual intervention",
"use_when": "Complex failures requiring human judgment"
},
"saga_orchestrator": {
"description": "Orchestrator handles compensation",
"use_when": "Orchestration-based saga"
}
}
# Example compensation handler
class CompensationHandler:
def __init__(self, saga_store, notification_service):
self.saga_store = saga_store
self.notification = notification_service
def handle_failure(self, saga_id):
saga = self.saga_store.find(saga_id)
if saga.status == SagaState.COMPENSATING:
# Retry compensation
self._retry_compensation(saga)
elif saga.compensation_attempts > 3:
# Escalate to manual intervention
self.notification.alert_ops_team(saga)
self.saga_store.mark_need_manual(saga_id)
def _retry_compensation(self, saga):
# Re-execute compensation logic
for step in reversed(saga.completed_steps):
try:
step.compensate(saga.data)
except:
saga.compensation_attempts += 1
raise
Idempotency
# Make operations idempotent to handle retries
class IdempotentPaymentService:
def __init__(self, payment_gateway, idempotency_store):
self.gateway = payment_gateway
self.store = idempotency_store
def charge(self, customer_id, amount, idempotency_key):
# Check if already processed
existing = self.store.get(idempotency_key)
if existing:
return existing
# Process payment
result = self.gateway.charge(customer_id, amount)
# Store result
self.store.set(idempotency_key, result)
return result
# Idempotency key can be derived from business data
def get_idempotency_key(action, *args):
# payment_customer123_amount100
return f"{action}_{'_'.join(str(a) for a in args)}"
Choreography vs Orchestration
Comparison
# Choreography vs Orchestration
choreography:
pros:
- Loose coupling between services
- No central point of failure
- Each service owns its logic
cons:
- Hard to track overall progress
- Complex to debug
- Implicit flow, harder to understand
best_for:
- Simple workflows (2-3 services)
- Teams that prefer event-driven
- When services are independent
orchestration:
pros:
- Centralized flow control
- Easy to track progress
- Clear error handling
cons:
- Central orchestrator becomes bottleneck
- More coupling to orchestrator
- Single point of failure (mitigate with redundancy)
best_for:
- Complex workflows (many services)
- When business logic is complex
- Need for compensation logic
Decision Matrix
def choose_saga_approach(service_count, team_size, complexity):
if service_count <= 3 and complexity == 'low':
return "choreography"
elif complexity == 'high':
return "orchestration"
elif team_size > 10:
return "orchestration" # Easier to understand
else:
return "choreography"
Implementation Example: Order Service
# Complete example with both approaches
# ========== CHOREOGRAPHY VERSION ==========
class OrderServiceChoreography:
"""Event-driven saga using choreography"""
def create_order(self, order_data):
# Create order
order = Order(
id=generate_uuid(),
status='pending',
**order_data
)
self.repo.save(order)
# Emit event - Saga starts
self.event_bus.publish('OrderCreated', {
'order_id': order.id,
**order_data
})
return order
# ========== ORCHESTRATION VERSION ==========
class OrderSagaOrchestrator:
"""Central orchestrator for order saga"""
def create_order(self, order_data):
# Create saga record
saga_id = generate_uuid()
saga_store.create(saga_id, 'order', order_data)
try:
# Execute saga steps
result = self._execute_order_saga(saga_id, order_data)
saga_store.mark_completed(saga_id)
return result
except SagaFailure as e:
saga_store.mark_failed(saga_id, str(e))
raise
def _execute_order_saga(self, saga_id, order_data):
# Step 1: Create order
order = self.order_service.create(order_data)
saga_store.update_step(saga_id, 'order_created',
{'order_id': order.id})
# Step 2: Process payment
try:
payment = self.payment_service.charge(
customer_id=order.customer_id,
amount=order.total
)
saga_store.update_step(saga_id, 'payment_processed',
{'payment_id': payment.id})
except PaymentFailed:
# Compensate order creation
self.order_service.cancel(order.id)
raise
# Step 3: Reserve inventory
try:
self.inventory_service.reserve(order.items)
saga_store.update_step(saga_id, 'inventory_reserved', {})
except InventoryError:
# Compensate payment
self.payment_service.refund(payment.id)
# Compensate order
self.order_service.cancel(order.id)
raise
# Step 4: Ship order
shipment = self.shipping_service.create(
order_id=order.id,
address=order.shipping_address
)
# Update order status
order.status = 'completed'
self.order_service.save(order)
return order
Best Practices
# Saga Best Practices
design:
- Keep sagas short (ideally < 10 steps)
- Make compensation idempotent
- Use idempotency keys for all operations
- Log saga state for debugging
timeout:
- Set appropriate timeouts for each step
- Use circuit breakers to prevent cascade failures
- Have clear escalation paths
monitoring:
- Track saga execution time
- Alert on stuck sagas
- Monitor compensation frequency
data:
- Never hold locks across saga steps
- Design for eventual consistency
- Use optimistic concurrency where possible
Conclusion
The Saga pattern is essential for managing distributed transactions in microservices:
- Choreography: Event-driven, loose coupling, harder to track
- Orchestration: Centralized, easier to understand, potential bottleneck
- Compensation: Each step must have a compensating action
- Idempotency: Essential for handling retries safely
- State Management: Persist saga state for recovery
Choose choreography for simple workflows and orchestration for complex business logic.
Comments