LLM Orchestration Patterns: Chains, Agents, Tools, and Memory
Building production LLM applications requires more than just API calls to GPT-4 or Claude. You need orchestrationโthe ability to compose multiple LLM interactions, integrate external tools, maintain context, and make autonomous decisions. This is where frameworks like LangChain and LlamaIndex shine.
But with great power comes complexity. Should you use a simple chain or a full agent? When does memory become essential? How do you design tools that LLMs can reliably use? This guide explores the core building blocks of LLM orchestration, helping you make informed architectural decisions for your AI systems.
Understanding the Orchestration Landscape
Before diving into specifics, let’s establish a mental model. LLM orchestration frameworks provide four fundamental primitives:
- Chains: Deterministic sequences of operations (LLM calls, data transformations, API requests)
- Agents: Autonomous systems that decide which actions to take based on observations
- Tools: Functions that extend LLM capabilities (search, calculation, database queries)
- Memory: Mechanisms for maintaining context across interactions
The key insight: start simple with chains, add tools when you need external capabilities, introduce agents when you need dynamic decision-making, and layer in memory when context matters.
Chains: The Foundation of Orchestration
Chains are the simplest orchestration patternโa predefined sequence of steps executed in order. Think of them as pipelines where each step’s output feeds into the next.
Simple Chains: Linear Execution
The most basic chain is a single LLM call with a prompt template:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import OpenAI
# Define a prompt template
prompt = PromptTemplate(
input_variables=["product"],
template="Generate 5 creative marketing slogans for {product}"
)
# Create a chain
chain = LLMChain(llm=OpenAI(temperature=0.7), prompt=prompt)
# Execute
result = chain.run(product="eco-friendly water bottles")
When to use simple chains:
- Single-purpose tasks with predictable inputs
- Content generation with consistent structure
- Data transformation pipelines
- Situations where determinism is critical
Sequential Chains: Multi-Step Processing
Sequential chains connect multiple LLM calls, passing outputs forward:
from langchain.chains import SimpleSequentialChain
# Chain 1: Generate a product description
description_chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
input_variables=["product"],
template="Write a detailed product description for {product}"
)
)
# Chain 2: Extract key features from description
features_chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
input_variables=["description"],
template="Extract 5 key features from this description:\n{description}"
)
)
# Combine into sequential chain
overall_chain = SimpleSequentialChain(
chains=[description_chain, features_chain],
verbose=True
)
result = overall_chain.run("wireless noise-canceling headphones")
When to use sequential chains:
- Multi-stage content pipelines (draft โ refine โ format)
- Analysis workflows (extract โ summarize โ categorize)
- Data enrichment processes
- When each step depends on the previous output
Parallel Chains: Concurrent Execution
For independent operations, parallel execution improves performance:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import asyncio
async def parallel_analysis(text):
# Define multiple independent analyses
sentiment_chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
input_variables=["text"],
template="Analyze the sentiment of: {text}"
)
)
entities_chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
input_variables=["text"],
template="Extract named entities from: {text}"
)
)
topics_chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
input_variables=["text"],
template="Identify main topics in: {text}"
)
)
# Execute in parallel
results = await asyncio.gather(
sentiment_chain.arun(text=text),
entities_chain.arun(text=text),
topics_chain.arun(text=text)
)
return {
"sentiment": results[0],
"entities": results[1],
"topics": results[2]
}
When to use parallel chains:
- Independent analyses on the same input
- Multiple perspectives on a problem
- Performance-critical applications
- Ensemble approaches (combining multiple model outputs)
Router Chains: Conditional Logic
Router chains select different paths based on input characteristics:
from langchain.chains.router import MultiPromptChain
from langchain.chains import ConversationChain
# Define specialized chains
physics_chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
template="As a physics expert, answer: {input}",
input_variables=["input"]
)
)
programming_chain = LLMChain(
llm=llm,
prompt=PromptTemplate(
template="As a senior developer, answer: {input}",
input_variables=["input"]
)
)
# Router decides which chain to use
router_chain = MultiPromptChain(
router_chain=..., # LLM-based router
destination_chains={
"physics": physics_chain,
"programming": programming_chain
},
default_chain=ConversationChain(llm=llm)
)
When to use router chains:
- Domain-specific expertise routing
- Multi-tenant applications with different behaviors
- Complexity-based routing (simple vs. complex queries)
- Cost optimization (route simple queries to cheaper models)
Chain Design Principles
Keep chains focused: Each chain should have a single, clear purpose. Avoid monolithic chains that try to do everything.
Handle errors gracefully: Chains can fail at any step. Implement retry logic and fallbacks:
from langchain.chains import LLMChain
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def run_chain_with_retry(chain, input_data):
try:
return chain.run(input_data)
except Exception as e:
print(f"Chain failed: {e}")
raise
Optimize for latency: Use streaming for long outputs, parallel execution for independent operations, and caching for repeated queries.
Test deterministically: Use temperature=0 for testing to ensure reproducible outputs.
Tools: Extending LLM Capabilities
LLMs are powerful but limitedโthey can’t browse the web, query databases, or perform precise calculations. Tools bridge this gap by giving LLMs access to external functions.
Anatomy of a Tool
A tool consists of three components:
- Name: A clear, descriptive identifier
- Description: Explains what the tool does and when to use it (critical for agent decision-making)
- Function: The actual implementation
from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper
# Example: Search tool
search = GoogleSearchAPIWrapper()
search_tool = Tool(
name="Google Search",
description="Useful for finding current information about events, people, or facts. Input should be a search query.",
func=search.run
)
Tool Design Principles
Write excellent descriptions: The LLM uses descriptions to decide when to call tools. Be specific about inputs and use cases:
# Bad description
description = "Searches the database"
# Good description
description = """
Searches the customer database by email or customer ID.
Input should be either:
- An email address (e.g., [email protected])
- A customer ID (e.g., CUST-12345)
Returns customer details including name, purchase history, and support tickets.
Use this when you need to look up specific customer information.
"""
Validate inputs: LLMs can generate malformed inputs. Always validate and sanitize:
from pydantic import BaseModel, validator
from langchain.tools import StructuredTool
class SearchInput(BaseModel):
query: str
max_results: int = 5
@validator('query')
def query_must_not_be_empty(cls, v):
if not v or not v.strip():
raise ValueError('Query cannot be empty')
return v.strip()
@validator('max_results')
def max_results_must_be_reasonable(cls, v):
if v < 1 or v > 20:
raise ValueError('max_results must be between 1 and 20')
return v
def search_function(query: str, max_results: int) -> str:
# Implementation
pass
search_tool = StructuredTool.from_function(
func=search_function,
name="search",
description="Search the web for information",
args_schema=SearchInput
)
Handle errors gracefully: Tools can fail. Return informative error messages that help the agent recover:
def robust_calculator(expression: str) -> str:
"""Evaluates mathematical expressions safely."""
try:
# Use a safe eval library
result = safe_eval(expression)
return f"Result: {result}"
except ZeroDivisionError:
return "Error: Division by zero. Please modify the expression."
except SyntaxError:
return "Error: Invalid mathematical expression. Please check syntax."
except Exception as e:
return f"Error: Could not evaluate expression. {str(e)}"
Keep tools focused: Each tool should do one thing well. Avoid Swiss Army knife tools:
# Bad: One tool that does everything
def database_tool(action, table, query, data):
if action == "select":
# ...
elif action == "insert":
# ...
elif action == "update":
# ...
# Good: Separate tools for different operations
def query_customers(email: str) -> str:
"""Query customer information by email."""
# ...
def update_customer(email: str, field: str, value: str) -> str:
"""Update a specific customer field."""
# ...
Tool Composition
Complex capabilities emerge from composing simple tools:
from langchain.agents import Tool
tools = [
Tool(
name="Search",
func=search.run,
description="Search for current information"
),
Tool(
name="Calculator",
func=calculator.run,
description="Perform mathematical calculations"
),
Tool(
name="Database Query",
func=db_query.run,
description="Query the customer database"
),
Tool(
name="Send Email",
func=email_sender.run,
description="Send an email to a customer"
)
]
An agent with these tools can autonomously: search for information, perform calculations on the results, query relevant database records, and send personalized emailsโall without explicit programming of the workflow.
Agents: Autonomous Decision-Making
While chains follow predefined paths, agents make dynamic decisions about which actions to take. They observe, reason, and act in a loop until they achieve their goal.
The ReAct Pattern
Most modern agents use the ReAct (Reasoning + Acting) pattern:
- Thought: The agent reasons about what to do next
- Action: The agent selects and executes a tool
- Observation: The agent receives the tool’s output
- Repeat: Continue until the goal is achieved
from langchain.agents import initialize_agent, AgentType
from langchain.llms import OpenAI
llm = OpenAI(temperature=0)
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
result = agent.run(
"What's the current price of Bitcoin? Calculate how much 2.5 BTC would be worth."
)
Agent execution trace:
Thought: I need to find the current Bitcoin price
Action: Search
Action Input: "current Bitcoin price USD"
Observation: Bitcoin is currently trading at $43,250
Thought: Now I need to calculate 2.5 times this price
Action: Calculator
Action Input: 2.5 * 43250
Observation: 108125
Thought: I have the answer
Final Answer: 2.5 BTC would be worth $108,125
Agent Types and When to Use Them
Zero-Shot ReAct Agent: Best for general-purpose tasks with diverse tools
- Decides actions based solely on tool descriptions
- No examples needed
- Good for: Customer support, research tasks, general Q&A
Conversational ReAct Agent: Maintains conversation history
- Remembers previous interactions
- Good for: Multi-turn dialogues, iterative problem-solving
Structured Chat Agent: Better at handling complex tool inputs
- Uses structured output parsing
- Good for: Tools with multiple parameters, API integrations
OpenAI Functions Agent: Leverages native function calling
- More reliable tool selection
- Lower latency
- Good for: Production systems, cost-sensitive applications
from langchain.agents import AgentType
# For production: Use OpenAI Functions when available
agent = initialize_agent(
tools=tools,
llm=ChatOpenAI(model="gpt-4", temperature=0),
agent=AgentType.OPENAI_FUNCTIONS,
verbose=True
)
Agent Design Patterns
Limit tool count: Agents struggle with too many tools. Keep it under 10-15 tools per agent:
# Instead of one agent with 30 tools, use specialized agents
customer_service_agent = initialize_agent(
tools=[search_customers, update_ticket, send_email],
llm=llm,
agent=AgentType.OPENAI_FUNCTIONS
)
technical_support_agent = initialize_agent(
tools=[check_logs, restart_service, escalate_issue],
llm=llm,
agent=AgentType.OPENAI_FUNCTIONS
)
# Route to appropriate agent based on query type
Set iteration limits: Prevent infinite loops:
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
max_iterations=10, # Prevent runaway execution
max_execution_time=60, # Timeout after 60 seconds
early_stopping_method="generate" # Return best effort if limit reached
)
Implement guardrails: Validate agent actions before execution:
from langchain.agents import AgentExecutor
class GuardedAgentExecutor(AgentExecutor):
def _should_execute_action(self, action):
# Prevent dangerous operations
if action.tool == "Database Delete" and "production" in action.tool_input:
return False, "Cannot delete from production database"
# Rate limiting
if self._get_action_count(action.tool) > 5:
return False, f"Too many calls to {action.tool}"
return True, None
Provide clear objectives: Vague goals lead to poor agent performance:
# Vague
agent.run("Help with customer issue")
# Clear
agent.run("""
Customer email: [email protected]
Issue: Cannot access account after password reset
Goal:
1. Look up customer account status
2. Check recent password reset attempts
3. If account is locked, unlock it
4. Send confirmation email with next steps
""")
When to Use Agents vs. Chains
Use chains when:
- The workflow is well-defined and predictable
- Determinism is critical (compliance, legal)
- You need precise control over execution
- Cost and latency are primary concerns
- The task is simple and linear
Use agents when:
- The workflow depends on dynamic information
- You need flexibility in problem-solving approaches
- The task requires multiple tools in unpredictable order
- You’re building conversational interfaces
- The problem space is too complex to enumerate all paths
Hybrid approach: Use chains within agent tools:
# Complex analysis as a chain
analysis_chain = SequentialChain(...)
# Expose chain as a tool to the agent
analysis_tool = Tool(
name="Detailed Analysis",
func=analysis_chain.run,
description="Performs comprehensive analysis including sentiment, entities, and topics"
)
# Agent can decide when to use the complex analysis
agent = initialize_agent(
tools=[search_tool, calculator_tool, analysis_tool],
llm=llm,
agent=AgentType.OPENAI_FUNCTIONS
)
Memory: Maintaining Context
LLMs are statelessโthey don’t remember previous interactions. Memory systems solve this by managing conversation history and relevant context.
Conversation Buffer Memory
The simplest memory: store all messages in a buffer.
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
memory = ConversationBufferMemory()
conversation = ConversationChain(
llm=llm,
memory=memory,
verbose=True
)
conversation.predict(input="Hi, I'm working on a Python project")
# Memory: Human: Hi, I'm working on a Python project
# AI: Great! I'd be happy to help...
conversation.predict(input="What language did I mention?")
# AI can reference the previous message: "You mentioned Python"
When to use:
- Short conversations (< 10 exchanges)
- When full context is essential
- Debugging and development
Limitations:
- Token limits: Long conversations exceed context windows
- Cost: Every message increases token usage
- Latency: More tokens = slower responses
Conversation Buffer Window Memory
Keep only the last N messages:
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(k=5) # Keep last 5 exchanges
conversation = ConversationChain(
llm=llm,
memory=memory
)
When to use:
- Longer conversations with recent context priority
- Cost-sensitive applications
- When older context becomes irrelevant
Trade-off: Loses older context that might still be relevant.
Conversation Summary Memory
Periodically summarize conversation history:
from langchain.memory import ConversationSummaryMemory
memory = ConversationSummaryMemory(llm=llm)
conversation = ConversationChain(
llm=llm,
memory=memory
)
# After several exchanges, memory contains:
# "The human is working on a Python web scraping project using BeautifulSoup.
# They encountered an issue with dynamic content and we discussed using Selenium.
# They prefer Chrome as their browser."
When to use:
- Long-running conversations
- When key facts matter more than exact wording
- Customer support sessions spanning multiple interactions
Trade-off: Summarization costs tokens and may lose nuance.
Vector Store Memory
Store conversation in a vector database, retrieve relevant context:
from langchain.memory import VectorStoreRetrieverMemory
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)
# Create retriever-based memory
memory = VectorStoreRetrieverMemory(
retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
# Add memories
memory.save_context(
{"input": "My favorite programming language is Python"},
{"output": "That's great! Python is versatile..."}
)
memory.save_context(
{"input": "I work in machine learning"},
{"output": "Python is excellent for ML..."}
)
# Later, relevant memories are retrieved
conversation.predict(input="What do you know about my work?")
# Retrieves: "I work in machine learning" and "favorite language is Python"
When to use:
- Very long conversations or sessions
- When specific facts need retrieval (customer preferences, project details)
- Multi-session applications (returning users)
Best for: Customer profiles, personalized assistants, knowledge workers.
Entity Memory
Track specific entities (people, places, concepts) mentioned in conversation:
from langchain.memory import ConversationEntityMemory
memory = ConversationEntityMemory(llm=llm)
conversation = ConversationChain(
llm=llm,
memory=memory
)
conversation.predict(input="John Smith is our lead developer. He prefers TypeScript.")
conversation.predict(input="Sarah Johnson handles DevOps. She uses Kubernetes.")
# Memory maintains entity knowledge:
# John Smith: lead developer, prefers TypeScript
# Sarah Johnson: handles DevOps, uses Kubernetes
conversation.predict(input="What does John prefer?")
# AI: "John Smith prefers TypeScript"
When to use:
- Tracking multiple people, projects, or concepts
- CRM-style applications
- Complex multi-entity scenarios
Combining Memory Types
Production systems often combine multiple memory strategies:
from langchain.memory import CombinedMemory, ConversationBufferWindowMemory, VectorStoreRetrieverMemory
# Recent context
short_term = ConversationBufferWindowMemory(k=3, memory_key="chat_history")
# Long-term facts
long_term = VectorStoreRetrieverMemory(
retriever=vectorstore.as_retriever(),
memory_key="long_term_context"
)
# Combine both
memory = CombinedMemory(memories=[short_term, long_term])
conversation = ConversationChain(
llm=llm,
memory=memory
)
This gives you:
- Immediate context from recent messages
- Relevant historical facts from vector search
- Optimal balance of cost, latency, and context quality
Memory Design Principles
Choose memory based on conversation length:
- < 10 exchanges: Buffer memory
- 10-50 exchanges: Window or summary memory
- 50+ exchanges or multi-session: Vector store memory
Implement memory persistence:
import json
# Save memory state
memory_state = memory.save_context()
with open('memory_state.json', 'w') as f:
json.dump(memory_state, f)
# Restore memory state
with open('memory_state.json', 'r') as f:
memory_state = json.load(f)
memory.load_context(memory_state)
Clear memory strategically:
# Clear when context switches
if user_starts_new_topic:
memory.clear()
# Or save and start fresh
old_memory = memory.save_context()
memory.clear()
# Store old_memory for potential retrieval
Monitor memory costs: Track token usage from memory:
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
response = conversation.predict(input="...")
print(f"Memory tokens: {cb.prompt_tokens}")
print(f"Cost: ${cb.total_cost}")
Putting It All Together: Architecture Patterns
Pattern 1: Simple RAG (Retrieval-Augmented Generation)
# Chain-based: Deterministic retrieval + generation
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(),
memory=ConversationBufferWindowMemory(k=3)
)
Use when: Document Q&A, knowledge bases, FAQ systems
Pattern 2: Conversational Agent with Tools
# Agent with memory and tools
from langchain.agents import AgentExecutor
memory = ConversationSummaryMemory(llm=llm, memory_key="chat_history")
agent = initialize_agent(
tools=[search_tool, calculator_tool, database_tool],
llm=llm,
agent=AgentType.OPENAI_FUNCTIONS,
memory=memory,
verbose=True
)
Use when: Customer support, personal assistants, research tools
Pattern 3: Multi-Agent System
# Specialized agents coordinated by a supervisor
supervisor_agent = initialize_agent(
tools=[
Tool(name="Research", func=research_agent.run),
Tool(name="Analysis", func=analysis_agent.run),
Tool(name="Writing", func=writing_agent.run)
],
llm=llm,
agent=AgentType.OPENAI_FUNCTIONS
)
Use when: Complex workflows, specialized domains, team simulation
Pattern 4: Chain-of-Thought with Validation
# Chain with self-correction
generation_chain = LLMChain(llm=llm, prompt=generation_prompt)
validation_chain = LLMChain(llm=llm, prompt=validation_prompt)
correction_chain = LLMChain(llm=llm, prompt=correction_prompt)
def generate_with_validation(input_text):
output = generation_chain.run(input_text)
is_valid = validation_chain.run(output)
if not is_valid:
output = correction_chain.run({"original": output, "input": input_text})
return output
Use when: High-stakes outputs, compliance requirements, quality-critical applications
Decision Framework: Choosing Your Architecture
Ask yourself these questions:
1. Is the workflow predictable?
- Yes โ Start with chains
- No โ Consider agents
2. Do you need external data or actions?
- Yes โ Add tools
- No โ Pure LLM chains may suffice
3. How long are conversations?
- < 10 exchanges โ Buffer memory
- 10-50 exchanges โ Window/summary memory
- 50+ or multi-session โ Vector store memory
4. What’s your error tolerance?
- Low โ Use chains with validation
- Medium โ Agents with guardrails
- High โ Agents with retry logic
5. What’s your latency budget?
- < 2s โ Simple chains, minimal memory
- 2-10s โ Agents with few tools
-
10s โ Complex agents, rich memory
6. What’s your cost sensitivity?
- High โ Chains, window memory, smaller models
- Medium โ Agents with tool limits, summary memory
- Low โ Full agents, vector memory, GPT-4
Conclusion
LLM orchestration is about choosing the right abstraction for your problem:
- Chains give you control and predictabilityโperfect for well-defined workflows
- Tools extend capabilities beyond text generationโessential for real-world integration
- Agents provide flexibility and autonomyโpowerful but require careful design
- Memory maintains contextโcritical for conversational and personalized experiences
Start simple: build a chain, add tools as needed, introduce agents when workflows become dynamic, and layer in memory when context matters. Test extensively, monitor costs, and iterate based on real usage patterns.
The frameworks are powerful, but the architecture is yours to design. Understanding these primitives empowers you to build LLM applications that are not just impressive demos, but production-ready systems that solve real problems reliably and efficiently.
Now go build something amazing.
Comments