Introduction
Microservices architecture structures an application as a collection of loosely coupled services. Each service is independently deployable and scalable, enabling technology diversity and team autonomy. When implemented correctly, microservices allow organizations to scale development velocity by enabling multiple teams to work independently on different services without stepping on each other.
This guide covers the full spectrum of microservices architecture โ from decomposition strategies and communication patterns to data management, observability, and production operations. It is designed for engineers and architects evaluating or implementing microservices, providing practical patterns and code examples that address real-world challenges.
When to Use Microservices
Microservices are not the right choice for every application. They introduce significant complexity in distributed system coordination, data consistency, and operational overhead. Before adopting microservices, evaluate whether your organization truly needs them.
You Should Consider Microservices When
- Your engineering team has grown beyond 10-15 developers and coordination on a shared codebase becomes painful
- Different parts of your application have different scaling requirements (e.g., the search feature needs 10x the compute of user management)
- You need to deploy changes to one part of the application without redeploying everything
- Different teams need the freedom to use different technology stacks for different problems
- Your application has clear domain boundaries that map naturally to independent services
You Should Stick with a Monolith When
- Your team is small (fewer than 10 developers)
- Your application is simple enough that a monolith serves all needs efficiently
- You are building an MVP where speed of iteration matters more than scalability
- Your organization lacks the operational maturity to manage distributed systems
- Your domain has tightly coupled data that is difficult to split
A common and recommended approach is the modular monolith โ a single deployable unit with strict module boundaries that mirror service boundaries. This allows you to validate your domain decomposition before paying the operational cost of distributed services. When the modular monolith outgrows its deployment model, extracting modules into independent services becomes a mechanical exercise.
Service Decomposition
Decomposition Strategies
Decomposing a system into microservices requires identifying the right service boundaries. The most effective strategies align services with business capabilities rather than technical layers:
# Example: E-commerce microservices decomposition
# User Service โ User management and authentication
class UserService:
def create_user(self, email: str, name: str) -> User:
pass
def authenticate(self, email: str, password: str) -> Token:
pass
# Product Service โ Product catalog
class ProductService:
def get_product(self, product_id: str) -> Product:
pass
def search_products(self, query: str) -> List[Product]:
pass
# Order Service โ Order management
class OrderService:
def create_order(self, user_id: str, items: List[OrderItem]) -> Order:
pass
def get_order(self, order_id: str) -> Order:
pass
def cancel_order(self, order_id: str) -> Order:
pass
# Inventory Service โ Stock management
class InventoryService:
def reserve_stock(self, items: List[StockItem]) -> bool:
pass
def release_stock(self, reservation_id: str):
pass
# Payment Service โ Payment processing
class PaymentService:
def process_payment(self, order_id: str, amount: float) -> Payment:
pass
def refund_payment(self, payment_id: str) -> Refund:
pass
# Notification Service โ Email, SMS, push
class NotificationService:
def send_order_confirmation(self, user_id: str, order_id: str):
pass
def send_shipping_update(self, user_id: str, tracking: str):
pass
Decomposition by Subdomain (Domain-Driven Design)
Domain-Driven Design (DDD) provides the most reliable framework for identifying service boundaries. Each microservice should map to a DDD subdomain or bounded context:
# Each bounded context owns its data and behavior completely
# Bounded Context: Ordering
class Order:
def __init__(self, order_id: str, user_id: str, items: List[OrderItem]):
self.order_id = order_id
self.user_id = user_id
self.items = items
self.status = OrderStatus.PENDING
def calculate_total(self) -> Money:
subtotal = sum(item.line_total for item in self.items)
discount = self._apply_discounts(subtotal)
tax = self._calculate_tax(subtotal - discount)
return subtotal - discount + tax
def _apply_discounts(self, subtotal: float) -> float:
"""Only the Ordering context knows about discount rules."""
if subtotal > 100:
return subtotal * 0.10 # 10% discount for large orders
if len(self.items) > 5:
return subtotal * 0.05 # 5% bulk discount
return 0.0
def _calculate_tax(self, amount: float) -> float:
"""Only the Ordering context owns tax calculation logic."""
# Tax rules are complex and specific to this bounded context
return amount * TaxRate.for_region(self.shipping_address.region)
# Bounded Context: Billing (separate service, separate database)
class Invoice:
def __init__(self, order_id: str, amount: Money):
self.invoice_id = str(uuid.uuid4())
self.order_id = order_id
self.amount = amount
self.status = InvoiceStatus.UNPAID
self.due_date = datetime.now() + timedelta(days=30)
Decomposition Anti-Patterns
| Anti-Pattern | Why It Fails |
|---|---|
| Split by layer (frontend, backend, database) | Creates chatty services that require coordinated changes |
| Shared database across services | Tight coupling โ changes in one service break others |
| Nano-services (too many tiny services) | Operational complexity outweighs benefits |
| Service per entity (UserService for User table) | Encourages anemic domain models with logic scattered across services |
| Premature decomposition | You don’t know the boundaries until you’ve built the system |
Communication Patterns
Synchronous Communication (REST/gRPC)
Synchronous communication works well for query operations and request-response workflows where the client needs an immediate answer:
# Synchronous: REST call via HTTP
class OrderService:
def __init__(self, http_client, product_service_url: str):
self.http_client = http_client
self.product_url = product_service_url
def create_order(self, user_id: str, items: List[dict]):
# Check product availability synchronously
for item in items:
response = self.http_client.get(
f"{self.product_url}/products/{item['product_id']}"
)
product = response.json()
if product["stock"] < item["quantity"]:
raise OutOfStockError(item["product_id"],
available=product["stock"])
order = Order(user_id=user_id, items=items)
self.order_repository.save(order)
return order
// gRPC service definition (preferred for internal service-to-service)
syntax = "proto3";
service ProductService {
rpc GetProduct (GetProductRequest) returns (Product);
rpc CheckAvailability (CheckAvailabilityRequest) returns (AvailabilityResponse);
rpc SearchProducts (SearchRequest) returns (SearchResponse);
}
message GetProductRequest {
string product_id = 1;
}
message Product {
string id = 1;
string name = 2;
double price = 3;
int32 stock = 4;
repeated string categories = 5;
}
message CheckAvailabilityRequest {
string product_id = 1;
int32 quantity = 2;
}
message AvailabilityResponse {
bool available = 1;
int32 available_stock = 2;
}
Asynchronous Communication (Events)
Asynchronous communication via events is preferred for operations that can be processed eventually, and for propagating state changes across services:
# Asynchronous: Event-driven communication
class OrderService:
def __init__(self, event_bus, order_repository):
self.event_bus = event_bus
self.order_repository = order_repository
def create_order(self, user_id: str, items: List[dict]):
order = Order(user_id=user_id, items=items)
self.order_repository.save(order)
# Publish event for async processing by other services
self.event_bus.publish(OrderCreatedEvent(
order_id=order.id,
user_id=user_id,
items=[OrderItemDTO(**item) for item in items],
total=order.calculate_total(),
timestamp=datetime.utcnow(),
))
return order
# Separate service subscribes to the event
class NotificationService:
def __init__(self, event_bus):
event_bus.subscribe(OrderCreatedEvent, self.handle_order_created)
async def handle_order_created(self, event: OrderCreatedEvent):
user = await self.user_service.get_user(event.user_id)
# Send confirmation email
await self.email_service.send(
to=user.email,
template="order_confirmation",
data={"order_id": event.order_id, "items": event.items},
)
# Schedule delivery notification
if user.preferences.push_enabled:
await self.push_service.schedule(
user_id=event.user_id,
message=f"Order {event.order_id} confirmed!",
delay=timedelta(hours=1),
)
Synchronous vs. Asynchronous: Decision Guide
| Consideration | Synchronous | Asynchronous |
|---|---|---|
| Client needs immediate response | โ | โ |
| Operation must be atomic | โ | โ (eventual consistency) |
| Multiple services need the data | โ (tight coupling) | โ (event propagation) |
| Failure isolation | โ (cascading failures) | โ (independent retries) |
| Debugging and tracing | Easier (linear flow) | Harder (eventual flow) |
| Complexity | Lower | Higher |
Service Discovery
In a dynamic microservices environment, service instances come and go. Service discovery provides a way for services to find each other without hardcoded addresses:
# Consul-based service discovery
import consul
class ServiceDiscovery:
def __init__(self, consul_host: str = "localhost"):
self.client = consul.Consul(host=consul_host)
def register(self, service_name: str, instance_id: str,
address: str, port: int):
self.client.agent.service.register(
name=service_name,
service_id=instance_id,
address=address,
port=port,
check=consul.Check().tcp(f"{address}:{port}", "10s"),
)
def discover(self, service_name: str) -> List[dict]:
_, services = self.client.health.service(
service_name, passing=True
)
return [
{
"address": s["Service"]["Address"],
"port": s["Service"]["Port"],
}
for s in services
]
def deregister(self, instance_id: str):
self.client.agent.service.deregister(instance_id)
For Kubernetes-native environments, built-in DNS-based service discovery eliminates the need for external service discovery tools:
# Kubernetes Service โ automatically discoverable via DNS
apiVersion: v1
kind: Service
metadata:
name: product-service
spec:
selector:
app: product-service
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
# Other services resolve to: product-service.namespace.svc.cluster.local
API Gateway Pattern
The API gateway serves as the single entry point for all client requests, handling cross-cutting concerns that individual services should not manage independently:
class APIGateway:
"""Single entry point for all client requests with cross-cutting concerns."""
def __init__(self):
self.routes = {
"/api/users": "http://user-service:8080",
"/api/products": "http://product-service:8080",
"/api/orders": "http://order-service:8080",
"/api/payments": "http://payment-service:8080",
"/api/search": "http://search-service:8080",
}
async def handle_request(self, request: Request) -> Response:
# 1. Authentication โ validate before routing
user = await self.authenticate(request)
if not user:
return Response(status_code=401, body="Unauthorized")
# 2. Rate limiting per client
if not self.rate_limiter.check(user.id, request.path):
return Response(status_code=429, body="Rate limit exceeded")
# 3. Request logging
request_id = str(uuid.uuid4())
correlation_id = request.headers.get("X-Correlation-ID", request_id)
# 4. Route and forward
for prefix, service_url in self.routes.items():
if request.path.startswith(prefix):
response = await self._forward_with_retry(
method=request.method,
url=service_url + request.path,
body=request.body,
headers=self._build_headers(user, correlation_id),
)
return response
return Response(status_code=404, body="Not found")
async def _forward_with_retry(
self, method: str, url: str, body: dict, headers: dict
) -> Response:
max_retries = 2
for attempt in range(max_retries):
try:
return await self.http_client.request(
method, url, json=body, headers=headers,
timeout=5.0,
)
except RequestTimeout as e:
if attempt == max_retries - 1:
raise
continue
except ServiceUnavailable:
# Circuit breaker logic would go here
raise
def _build_headers(self, user: User, correlation_id: str) -> dict:
return {
"X-User-ID": user.id,
"X-User-Role": user.role,
"X-Correlation-ID": correlation_id,
"X-Request-Start": str(time.time()),
}
Gateway Responsibilities
| Concern | Implementation |
|---|---|
| Authentication | Validate JWT, extract user context |
| Rate limiting | Token bucket per client/IP |
| Request routing | Path-based to appropriate service |
| Response aggregation | Combine responses from multiple services |
| Protocol translation | HTTP to gRPC, REST to GraphQL |
| Circuit breaking | Fail fast when downstream services are down |
| Request/response transformation | Add/remove headers, format conversion |
| CORS management | Single CORS policy for all services |
Data Management
Database per Service
Each microservice owns its data and exposes it only through its API. Direct database access from other services is forbidden:
# Order Service โ owns order data exclusively
class OrderRepository:
def __init__(self, db_session):
self.session = db_session
def save(self, order: Order) -> Order:
self.session.add(order)
self.session.commit()
return order
def find_by_id(self, order_id: str) -> Optional[Order]:
return self.session.query(Order).filter_by(id=order_id).first()
def find_by_user(self, user_id: str, limit: int = 20) -> List[Order]:
return (
self.session.query(Order)
.filter_by(user_id=user_id)
.order_by(Order.created_at.desc())
.limit(limit)
.all()
)
# BAD โ other services must NOT access the order database directly
# from order_service.models import Order # DON'T DO THIS
# Order.query.filter_by(user_id=user_id).all() # DON'T DO THIS
Saga Pattern for Distributed Transactions
When an operation spans multiple services, a saga coordinates the steps and provides compensation actions for rollback:
# Choreography-based saga โ each service handles its own step and publishes events
class OrderSaga:
"""Coordinates the order creation saga across services."""
def __init__(self, event_bus, order_repository):
self.event_bus = event_bus
self.order_repository = order_repository
self._register_handlers()
def _register_handlers(self):
self.event_bus.subscribe(OrderCreated, self.on_order_created)
self.event_bus.subscribe(InventoryReserved, self.on_inventory_reserved)
self.event_bus.subscribe(PaymentProcessed, self.on_payment_processed)
self.event_bus.subscribe(InventoryReservationFailed,
self.on_inventory_failed)
self.event_bus.subscribe(PaymentFailed, self.on_payment_failed)
async def on_order_created(self, event: OrderCreated):
"""Step 1: Order created โ try to reserve inventory."""
await self.event_bus.publish(ReserveInventoryCommand(
order_id=event.order_id,
items=event.items,
))
async def on_inventory_reserved(self, event: InventoryReserved):
"""Step 2: Inventory reserved โ process payment."""
await self.event_bus.publish(ProcessPaymentCommand(
order_id=event.order_id,
amount=event.total,
user_id=event.user_id,
))
async def on_payment_processed(self, event: PaymentProcessed):
"""Step 3: Payment successful โ confirm order."""
self.order_repository.update_status(
event.order_id, OrderStatus.CONFIRMED
)
await self.event_bus.publish(OrderConfirmed(
order_id=event.order_id,
))
async def on_inventory_failed(self, event: InventoryReservationFailed):
"""Compensation: Inventory unavailable โ cancel order."""
self.order_repository.update_status(
event.order_id, OrderStatus.CANCELLED
)
await self.event_bus.publish(OrderCancelled(
order_id=event.order_id,
reason="Inventory unavailable",
))
async def on_payment_failed(self, event: PaymentFailed):
"""Compensation: Payment failed โ release inventory, cancel order."""
await self.event_bus.publish(ReleaseInventoryCommand(
order_id=event.order_id,
items=event.items,
))
self.order_repository.update_status(
event.order_id, OrderStatus.CANCELLED
)
CQRS (Command Query Responsibility Segregation)
CQRS separates read and write models, allowing each to be optimized independently. This is particularly valuable for services with asymmetric read/write patterns:
# Command side โ optimized for writes
class OrderCommandHandler:
def __init__(self, order_repository, event_bus):
self.order_repository = order_repository
self.event_bus = event_bus
def handle_create_order(self, command: CreateOrderCommand) -> Order:
order = Order(
user_id=command.user_id,
items=[OrderItem(**i) for i in command.items],
)
# Validate business rules
if not order.has_valid_items():
raise InvalidOrderError("Order contains invalid items")
saved = self.order_repository.save(order)
self.event_bus.publish(OrderCreated.from_order(saved))
return saved
# Query side โ optimized for reads (could use different storage)
class OrderQueryHandler:
def __init__(self, read_db):
# read_db could be a read replica, elasticsearch, or materialized view
self.db = read_db
def get_user_orders(self, user_id: str, page: int = 1) -> OrderListResponse:
"""Read-optimized query with pre-joined data."""
return self.db.query(
"""
SELECT o.id, o.total, o.status, o.created_at,
COUNT(i.id) as item_count,
JSON_AGG(
JSON_BUILD_OBJECT(
'name', p.name,
'price', i.price,
'quantity', i.quantity
)
) as items
FROM order_views o
JOIN order_item_views i ON o.id = i.order_id
JOIN product_views p ON i.product_id = p.id
WHERE o.user_id = $1
GROUP BY o.id
ORDER BY o.created_at DESC
LIMIT 20 OFFSET $2
""",
user_id, (page - 1) * 20,
)
Observability
Distributed Tracing
Tracing requests across service boundaries is essential for debugging performance issues and understanding system behavior:
# OpenTelemetry distributed tracing
from opentelemetry import trace
from opentelemetry.instrumentation.requests import RequestsInstrumentor
tracer = trace.get_tracer(__name__)
class OrderService:
@tracer.start_as_current_span("order_service.create_order")
def create_order(self, user_id: str, items: List[dict]) -> Order:
current_span = trace.get_current_span()
current_span.set_attribute("user_id", user_id)
current_span.set_attribute("item_count", len(items))
order = Order(user_id=user_id, items=items)
self.order_repository.save(order)
# Propagate trace context to downstream services
with tracer.start_as_current_span("inventory.check"):
inventory_result = self.inventory_client.check_availability(
items=items,
metadata={
"trace_id": current_span.get_span_context().trace_id,
},
)
return order
Structured Logging
Each log entry should include correlation IDs and service context to enable cross-service debugging:
import structlog
logger = structlog.get_logger()
class OrderService:
async def process_order(self, order_id: str) -> Order:
logger.info("processing_order", order_id=order_id)
try:
order = await self.order_repository.find_by_id(order_id)
if not order:
logger.warning("order_not_found", order_id=order_id)
raise OrderNotFoundError(order_id)
result = await self._execute_processing(order)
logger.info("order_processed",
order_id=order_id,
status=result.status,
processing_time_ms=result.duration_ms,
)
return result
except PaymentDeclinedError as e:
logger.warning("payment_declined",
order_id=order_id,
reason=e.reason,
payment_provider=e.provider,
)
raise
except Exception:
logger.exception("order_processing_failed", order_id=order_id)
raise
Health Checks and Readiness Probes
Every service should expose health endpoints for orchestration platforms:
# FastAPI health endpoints
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI()
@app.get("/healthz")
async def liveness():
"""Kubernetes liveness probe โ is the process alive?"""
return {"status": "alive"}
@app.get("/ready")
async def readiness(deps=DependencyChecker()):
"""Kubernetes readiness probe โ can the service handle requests?"""
statuses = await deps.check_all()
all_ready = all(s.ready for s in statuses)
if not all_ready:
return JSONResponse(
status_code=503,
content={
"ready": False,
"dependencies": [
{"name": s.name, "ready": s.ready}
for s in statuses
],
},
)
return {"ready": True, "dependencies": statuses}
Deployment and Operations
Containerization
Each microservice is packaged as a container image for consistent deployment:
# Multi-stage Docker build โ optimized for size and security
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.12-slim
RUN groupadd -r app && useradd -r -g app app
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
USER app
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthz')"
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
spec:
replicas: 3
strategy:
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
version: v2.1.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
serviceAccountName: order-service
containers:
- name: order-service
image: registry.example.com/order-service:v2.1.0
ports:
- containerPort: 8080
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: order-db-credentials
key: host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: order-db-credentials
key: password
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10 && kill -SIGTERM 1"]
---
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
selector:
app: order-service
ports:
- port: 80
targetPort: 8080
Testing Microservices
Test Strategy
| Test Type | Scope | Speed | Confidence |
|---|---|---|---|
| Unit tests | Single class/function | Milliseconds | Low |
| Integration tests | Service + dependencies | Seconds | Medium |
| Contract tests | Service boundaries | Minutes | High |
| End-to-end tests | Multiple services | Minutes | Very high |
| Chaos tests | Resilience under failure | Hours | Highest |
Contract Testing with Pact
Contract tests verify that services agree on API semantics without requiring full end-to-end deployment:
# Consumer-side contract test
import pact
@pact.verify(
provider="ProductService",
consumer="OrderService",
pact_dir="./pacts",
)
class ProductServicePactTest:
def test_get_product(self):
expected = {
"id": "prod_123",
"name": "Wireless Mouse",
"price": 29.99,
"stock": 150,
}
with pact.given("product exists"):
pact.upon_receiving("a request for a product")
pact.with_request(method="GET", path="/api/products/prod_123")
pact.will_respond_with(status=200, body=expected)
result = self.order_service.product_client.get_product("prod_123")
self.assertEqual(result.id, "prod_123")
self.assertEqual(result.price, 29.99)
def test_product_not_found(self):
with pact.given("product does not exist"):
pact.upon_receiving("a request for missing product")
pact.with_request(method="GET", path="/api/products/prod_999")
pact.will_respond_with(status=404, body={
"error": "Product not found",
})
with self.assertRaises(ProductNotFoundError):
self.order_service.product_client.get_product("prod_999")
Common Pitfalls and How to Avoid Them
1. Shared Database
Resist the temptation to share databases between services. A shared database creates hidden coupling โ a schema change in one service can break another. Each service must own its data exclusively.
2. Chatty Communication
Design service APIs for coarse-grained operations. A service that fetches a user, then their orders, then order items from three different endpoints creates excessive network round-trips. Instead, provide composite endpoints:
# BAD: Three round-trips
user = user_service.get_user(id)
orders = order_service.get_orders(user_id=user.id)
items = order_service.get_order_items(order_id=orders[0].id)
# GOOD: Single composite endpoint
dashboard = order_service.get_user_dashboard(user_id=user.id)
3. Ignoring Failure
Distributed systems fail in complex ways. Network partitions, slow responses, and transient errors are normal, not exceptional. Every service call must handle timeouts, retries with backoff, and graceful degradation:
from tenacity import retry, stop_after_attempt, wait_exponential
class ResilientClient:
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
)
async def call_downstream(self, url: str) -> dict:
try:
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=5.0) as response:
return await response.json()
except asyncio.TimeoutError:
logger.warning("downstream_timeout", url=url)
raise
except aiohttp.ClientError as e:
logger.error("downstream_failure", url=url, error=str(e))
raise
4. Distributed Monolith
A distributed monolith happens when services are deployed separately but still require coordinated deployments because of tight coupling. Symptoms include:
- Changing one service requires changes in multiple other services
- Services share model classes or libraries
- Feature development spans more than 2-3 services
Prevention: enforce strict domain boundaries, use event-driven communication, and keep service APIs stable.
Conclusion
Microservices offer independence and scalability but introduce complexity. Decompose by business capability using domain-driven design, prefer asynchronous communication with events for state changes, implement API gateways for cross-cutting concerns, and design for failure from day one.
The most effective path to microservices is evolutionary: start with a well-structured modular monolith, validate your domain boundaries, and extract services incrementally as your organization’s needs and capabilities grow. Microservices are a means to an end โ faster, safer software delivery โ not an end in themselves.
Resources
- “Building Microservices” by Sam Newman โ the definitive guide
- “Microservices Patterns” by Chris Richardson โ practical patterns for distributed data management
- “Domain-Driven Design” by Eric Evans โ foundation for service decomposition
- Microservices.io โ pattern catalog by Chris Richardson
- OpenTelemetry Documentation โ observability standards
- Kubernetes Documentation โ container orchestration
- Pact Documentation โ contract testing framework
Comments