LLM Orchestration Patterns: Chains, Agents, Tools, and Memory

Building production LLM applications requires more than just API calls to GPT-4 or Claude. You need orchestration—the ability to compose multiple LLM interactions, integrate external tools, maintain context, and make autonomous decisions. This is where frameworks like LangChain and LlamaIndex shine.

But with great power comes complexity. Should you use a simple chain or a full agent? When does memory become essential? How do you design tools that LLMs can reliably use? This guide explores the core building blocks of LLM orchestration, helping you make informed architectural decisions for your AI systems.

Understanding the Orchestration Landscape

Before diving into specifics, let’s establish a mental model. LLM orchestration frameworks provide four fundamental primitives:

Chains: Deterministic sequences of operations (LLM calls, data transformations, API requests)
Agents: Autonomous systems that decide which actions to take based on observations
Tools: Functions that extend LLM capabilities (search, calculation, database queries)
Memory: Mechanisms for maintaining context across interactions

The key insight: start simple with chains, add tools when you need external capabilities, introduce agents when you need dynamic decision-making, and layer in memory when context matters.

Chains: The Foundation of Orchestration

Chains are the simplest orchestration pattern—a predefined sequence of steps executed in order. Think of them as pipelines where each step’s output feeds into the next.

Simple Chains: Linear Execution

The most basic chain is a single LLM call with a prompt template:

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import OpenAI

# Define a prompt template
prompt = PromptTemplate(
    input_variables=["product"],
    template="Generate 5 creative marketing slogans for {product}"
)

# Create a chain
chain = LLMChain(llm=OpenAI(temperature=0.7), prompt=prompt)

# Execute
result = chain.run(product="eco-friendly water bottles")

When to use simple chains:

Single-purpose tasks with predictable inputs
Content generation with consistent structure
Data transformation pipelines
Situations where determinism is critical

Sequential Chains: Multi-Step Processing

Sequential chains connect multiple LLM calls, passing outputs forward:

from langchain.chains import SimpleSequentialChain

# Chain 1: Generate a product description
description_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate(
        input_variables=["product"],
        template="Write a detailed product description for {product}"
    )
)

# Chain 2: Extract key features from description
features_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate(
        input_variables=["description"],
        template="Extract 5 key features from this description:\n{description}"
    )
)

# Combine into sequential chain
overall_chain = SimpleSequentialChain(
    chains=[description_chain, features_chain],
    verbose=True
)

result = overall_chain.run("wireless noise-canceling headphones")

When to use sequential chains:

Multi-stage content pipelines (draft → refine → format)
Analysis workflows (extract → summarize → categorize)
Data enrichment processes
When each step depends on the previous output

Parallel Chains: Concurrent Execution

For independent operations, parallel execution improves performance:

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import asyncio

async def parallel_analysis(text):
    # Define multiple independent analyses
    sentiment_chain = LLMChain(
        llm=llm,
        prompt=PromptTemplate(
            input_variables=["text"],
            template="Analyze the sentiment of: {text}"
        )
    )
    
    entities_chain = LLMChain(
        llm=llm,
        prompt=PromptTemplate(
            input_variables=["text"],
            template="Extract named entities from: {text}"
        )
    )
    
    topics_chain = LLMChain(
        llm=llm,
        prompt=PromptTemplate(
            input_variables=["text"],
            template="Identify main topics in: {text}"
        )
    )
    
    # Execute in parallel
    results = await asyncio.gather(
        sentiment_chain.arun(text=text),
        entities_chain.arun(text=text),
        topics_chain.arun(text=text)
    )
    
    return {
        "sentiment": results[0],
        "entities": results[1],
        "topics": results[2]
    }

When to use parallel chains:

Independent analyses on the same input
Multiple perspectives on a problem
Performance-critical applications
Ensemble approaches (combining multiple model outputs)

Router Chains: Conditional Logic

Router chains select different paths based on input characteristics:

from langchain.chains.router import MultiPromptChain
from langchain.chains import ConversationChain

# Define specialized chains
physics_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate(
        template="As a physics expert, answer: {input}",
        input_variables=["input"]
    )
)

programming_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate(
        template="As a senior developer, answer: {input}",
        input_variables=["input"]
    )
)

# Router decides which chain to use
router_chain = MultiPromptChain(
    router_chain=...,  # LLM-based router
    destination_chains={
        "physics": physics_chain,
        "programming": programming_chain
    },
    default_chain=ConversationChain(llm=llm)
)

When to use router chains:

Domain-specific expertise routing
Multi-tenant applications with different behaviors
Complexity-based routing (simple vs. complex queries)
Cost optimization (route simple queries to cheaper models)

Chain Design Principles

Keep chains focused: Each chain should have a single, clear purpose. Avoid monolithic chains that try to do everything.

Handle errors gracefully: Chains can fail at any step. Implement retry logic and fallbacks:

from langchain.chains import LLMChain
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def run_chain_with_retry(chain, input_data):
    try:
        return chain.run(input_data)
    except Exception as e:
        print(f"Chain failed: {e}")
        raise

Optimize for latency: Use streaming for long outputs, parallel execution for independent operations, and caching for repeated queries.

Test deterministically: Use temperature=0 for testing to ensure reproducible outputs.

Tools: Extending LLM Capabilities

LLMs are powerful but limited—they can’t browse the web, query databases, or perform precise calculations. Tools bridge this gap by giving LLMs access to external functions.

Anatomy of a Tool

A tool consists of three components:

Name: A clear, descriptive identifier
Description: Explains what the tool does and when to use it (critical for agent decision-making)
Function: The actual implementation

from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper

# Example: Search tool
search = GoogleSearchAPIWrapper()

search_tool = Tool(
    name="Google Search",
    description="Useful for finding current information about events, people, or facts. Input should be a search query.",
    func=search.run
)

Tool Design Principles

Write excellent descriptions: The LLM uses descriptions to decide when to call tools. Be specific about inputs and use cases:

# Bad description
description = "Searches the database"

# Good description
description = """
Searches the customer database by email or customer ID.
Input should be either:
- An email address (e.g., [email protected])
- A customer ID (e.g., CUST-12345)
Returns customer details including name, purchase history, and support tickets.
Use this when you need to look up specific customer information.
"""

Validate inputs: LLMs can generate malformed inputs. Always validate and sanitize:

from pydantic import BaseModel, validator
from langchain.tools import StructuredTool

class SearchInput(BaseModel):
    query: str
    max_results: int = 5
    
    @validator('query')
    def query_must_not_be_empty(cls, v):
        if not v or not v.strip():
            raise ValueError('Query cannot be empty')
        return v.strip()
    
    @validator('max_results')
    def max_results_must_be_reasonable(cls, v):
        if v < 1 or v > 20:
            raise ValueError('max_results must be between 1 and 20')
        return v

def search_function(query: str, max_results: int) -> str:
    # Implementation
    pass

search_tool = StructuredTool.from_function(
    func=search_function,
    name="search",
    description="Search the web for information",
    args_schema=SearchInput
)

Handle errors gracefully: Tools can fail. Return informative error messages that help the agent recover:

def robust_calculator(expression: str) -> str:
    """Evaluates mathematical expressions safely."""
    try:
        # Use a safe eval library
        result = safe_eval(expression)
        return f"Result: {result}"
    except ZeroDivisionError:
        return "Error: Division by zero. Please modify the expression."
    except SyntaxError:
        return "Error: Invalid mathematical expression. Please check syntax."
    except Exception as e:
        return f"Error: Could not evaluate expression. {str(e)}"

Keep tools focused: Each tool should do one thing well. Avoid Swiss Army knife tools:

# Bad: One tool that does everything
def database_tool(action, table, query, data):
    if action == "select":
        # ...
    elif action == "insert":
        # ...
    elif action == "update":
        # ...

# Good: Separate tools for different operations
def query_customers(email: str) -> str:
    """Query customer information by email."""
    # ...

def update_customer(email: str, field: str, value: str) -> str:
    """Update a specific customer field."""
    # ...

Tool Composition

Complex capabilities emerge from composing simple tools:

from langchain.agents import Tool

tools = [
    Tool(
        name="Search",
        func=search.run,
        description="Search for current information"
    ),
    Tool(
        name="Calculator",
        func=calculator.run,
        description="Perform mathematical calculations"
    ),
    Tool(
        name="Database Query",
        func=db_query.run,
        description="Query the customer database"
    ),
    Tool(
        name="Send Email",
        func=email_sender.run,
        description="Send an email to a customer"
    )
]

An agent with these tools can autonomously: search for information, perform calculations on the results, query relevant database records, and send personalized emails—all without explicit programming of the workflow.

Agents: Autonomous Decision-Making

While chains follow predefined paths, agents make dynamic decisions about which actions to take. They observe, reason, and act in a loop until they achieve their goal.

The ReAct Pattern

Most modern agents use the ReAct (Reasoning + Acting) pattern:

Thought: The agent reasons about what to do next
Action: The agent selects and executes a tool
Observation: The agent receives the tool’s output
Repeat: Continue until the goal is achieved

from langchain.agents import initialize_agent, AgentType
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

result = agent.run(
    "What's the current price of Bitcoin? Calculate how much 2.5 BTC would be worth."
)

Agent execution trace:

Thought: I need to find the current Bitcoin price
Action: Search
Action Input: "current Bitcoin price USD"
Observation: Bitcoin is currently trading at $43,250

Thought: Now I need to calculate 2.5 times this price
Action: Calculator
Action Input: 2.5 * 43250
Observation: 108125

Thought: I have the answer
Final Answer: 2.5 BTC would be worth $108,125

Agent Types and When to Use Them

Zero-Shot ReAct Agent: Best for general-purpose tasks with diverse tools

Decides actions based solely on tool descriptions
No examples needed
Good for: Customer support, research tasks, general Q&A

Conversational ReAct Agent: Maintains conversation history

Remembers previous interactions
Good for: Multi-turn dialogues, iterative problem-solving

Structured Chat Agent: Better at handling complex tool inputs

Uses structured output parsing
Good for: Tools with multiple parameters, API integrations

OpenAI Functions Agent: Leverages native function calling

More reliable tool selection
Lower latency
Good for: Production systems, cost-sensitive applications

from langchain.agents import AgentType

# For production: Use OpenAI Functions when available
agent = initialize_agent(
    tools=tools,
    llm=ChatOpenAI(model="gpt-4", temperature=0),
    agent=AgentType.OPENAI_FUNCTIONS,
    verbose=True
)

Agent Design Patterns

Limit tool count: Agents struggle with too many tools. Keep it under 10-15 tools per agent:

# Instead of one agent with 30 tools, use specialized agents
customer_service_agent = initialize_agent(
    tools=[search_customers, update_ticket, send_email],
    llm=llm,
    agent=AgentType.OPENAI_FUNCTIONS
)

technical_support_agent = initialize_agent(
    tools=[check_logs, restart_service, escalate_issue],
    llm=llm,
    agent=AgentType.OPENAI_FUNCTIONS
)

# Route to appropriate agent based on query type

Set iteration limits: Prevent infinite loops:

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    max_iterations=10,  # Prevent runaway execution
    max_execution_time=60,  # Timeout after 60 seconds
    early_stopping_method="generate"  # Return best effort if limit reached
)

Implement guardrails: Validate agent actions before execution:

from langchain.agents import AgentExecutor

class GuardedAgentExecutor(AgentExecutor):
    def _should_execute_action(self, action):
        # Prevent dangerous operations
        if action.tool == "Database Delete" and "production" in action.tool_input:
            return False, "Cannot delete from production database"
        
        # Rate limiting
        if self._get_action_count(action.tool) > 5:
            return False, f"Too many calls to {action.tool}"
        
        return True, None

Provide clear objectives: Vague goals lead to poor agent performance:

# Vague
agent.run("Help with customer issue")

# Clear
agent.run("""
Customer email: [email protected]
Issue: Cannot access account after password reset
Goal: 
1. Look up customer account status
2. Check recent password reset attempts
3. If account is locked, unlock it
4. Send confirmation email with next steps
""")

When to Use Agents vs. Chains

Use chains when:

The workflow is well-defined and predictable
Determinism is critical (compliance, legal)
You need precise control over execution
Cost and latency are primary concerns
The task is simple and linear

Use agents when:

The workflow depends on dynamic information
You need flexibility in problem-solving approaches
The task requires multiple tools in unpredictable order
You’re building conversational interfaces
The problem space is too complex to enumerate all paths

Hybrid approach: Use chains within agent tools:

# Complex analysis as a chain
analysis_chain = SequentialChain(...)

# Expose chain as a tool to the agent
analysis_tool = Tool(
    name="Detailed Analysis",
    func=analysis_chain.run,
    description="Performs comprehensive analysis including sentiment, entities, and topics"
)

# Agent can decide when to use the complex analysis
agent = initialize_agent(
    tools=[search_tool, calculator_tool, analysis_tool],
    llm=llm,
    agent=AgentType.OPENAI_FUNCTIONS
)

Memory: Maintaining Context

LLMs are stateless—they don’t remember previous interactions. Memory systems solve this by managing conversation history and relevant context.

Conversation Buffer Memory

The simplest memory: store all messages in a buffer.

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

memory = ConversationBufferMemory()

conversation = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True
)

conversation.predict(input="Hi, I'm working on a Python project")
# Memory: Human: Hi, I'm working on a Python project
#         AI: Great! I'd be happy to help...

conversation.predict(input="What language did I mention?")
# AI can reference the previous message: "You mentioned Python"

When to use:

Short conversations (< 10 exchanges)
When full context is essential
Debugging and development

Limitations:

Token limits: Long conversations exceed context windows
Cost: Every message increases token usage
Latency: More tokens = slower responses

Conversation Buffer Window Memory

Keep only the last N messages:

from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=5)  # Keep last 5 exchanges

conversation = ConversationChain(
    llm=llm,
    memory=memory
)

When to use:

Longer conversations with recent context priority
Cost-sensitive applications
When older context becomes irrelevant

Trade-off: Loses older context that might still be relevant.

Conversation Summary Memory

Periodically summarize conversation history:

from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(llm=llm)

conversation = ConversationChain(
    llm=llm,
    memory=memory
)

# After several exchanges, memory contains:
# "The human is working on a Python web scraping project using BeautifulSoup.
#  They encountered an issue with dynamic content and we discussed using Selenium.
#  They prefer Chrome as their browser."

When to use:

Long-running conversations
When key facts matter more than exact wording
Customer support sessions spanning multiple interactions

Trade-off: Summarization costs tokens and may lose nuance.

Vector Store Memory

Store conversation in a vector database, retrieve relevant context:

from langchain.memory import VectorStoreRetrieverMemory
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)

# Create retriever-based memory
memory = VectorStoreRetrieverMemory(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# Add memories
memory.save_context(
    {"input": "My favorite programming language is Python"},
    {"output": "That's great! Python is versatile..."}
)

memory.save_context(
    {"input": "I work in machine learning"},
    {"output": "Python is excellent for ML..."}
)

# Later, relevant memories are retrieved
conversation.predict(input="What do you know about my work?")
# Retrieves: "I work in machine learning" and "favorite language is Python"

When to use:

Very long conversations or sessions
When specific facts need retrieval (customer preferences, project details)
Multi-session applications (returning users)

Best for: Customer profiles, personalized assistants, knowledge workers.

Entity Memory

Track specific entities (people, places, concepts) mentioned in conversation:

from langchain.memory import ConversationEntityMemory

memory = ConversationEntityMemory(llm=llm)

conversation = ConversationChain(
    llm=llm,
    memory=memory
)

conversation.predict(input="John Smith is our lead developer. He prefers TypeScript.")
conversation.predict(input="Sarah Johnson handles DevOps. She uses Kubernetes.")

# Memory maintains entity knowledge:
# John Smith: lead developer, prefers TypeScript
# Sarah Johnson: handles DevOps, uses Kubernetes

conversation.predict(input="What does John prefer?")
# AI: "John Smith prefers TypeScript"

When to use:

Tracking multiple people, projects, or concepts
CRM-style applications
Complex multi-entity scenarios

Combining Memory Types

Production systems often combine multiple memory strategies:

from langchain.memory import CombinedMemory, ConversationBufferWindowMemory, VectorStoreRetrieverMemory

# Recent context
short_term = ConversationBufferWindowMemory(k=3, memory_key="chat_history")

# Long-term facts
long_term = VectorStoreRetrieverMemory(
    retriever=vectorstore.as_retriever(),
    memory_key="long_term_context"
)

# Combine both
memory = CombinedMemory(memories=[short_term, long_term])

conversation = ConversationChain(
    llm=llm,
    memory=memory
)

This gives you:

Immediate context from recent messages
Relevant historical facts from vector search
Optimal balance of cost, latency, and context quality

Memory Design Principles

Choose memory based on conversation length:

< 10 exchanges: Buffer memory
10-50 exchanges: Window or summary memory
50+ exchanges or multi-session: Vector store memory

Implement memory persistence:

import json

# Save memory state
memory_state = memory.save_context()
with open('memory_state.json', 'w') as f:
    json.dump(memory_state, f)

# Restore memory state
with open('memory_state.json', 'r') as f:
    memory_state = json.load(f)
memory.load_context(memory_state)

Clear memory strategically:

# Clear when context switches
if user_starts_new_topic:
    memory.clear()

# Or save and start fresh
old_memory = memory.save_context()
memory.clear()
# Store old_memory for potential retrieval

Monitor memory costs: Track token usage from memory:

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    response = conversation.predict(input="...")
    print(f"Memory tokens: {cb.prompt_tokens}")
    print(f"Cost: ${cb.total_cost}")

Putting It All Together: Architecture Patterns

Pattern 1: Simple RAG (Retrieval-Augmented Generation)

# Chain-based: Deterministic retrieval + generation
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    memory=ConversationBufferWindowMemory(k=3)
)

Use when: Document Q&A, knowledge bases, FAQ systems

Pattern 2: Conversational Agent with Tools

# Agent with memory and tools
from langchain.agents import AgentExecutor

memory = ConversationSummaryMemory(llm=llm, memory_key="chat_history")

agent = initialize_agent(
    tools=[search_tool, calculator_tool, database_tool],
    llm=llm,
    agent=AgentType.OPENAI_FUNCTIONS,
    memory=memory,
    verbose=True
)

Use when: Customer support, personal assistants, research tools

Pattern 3: Multi-Agent System

# Specialized agents coordinated by a supervisor
supervisor_agent = initialize_agent(
    tools=[
        Tool(name="Research", func=research_agent.run),
        Tool(name="Analysis", func=analysis_agent.run),
        Tool(name="Writing", func=writing_agent.run)
    ],
    llm=llm,
    agent=AgentType.OPENAI_FUNCTIONS
)

Use when: Complex workflows, specialized domains, team simulation

Pattern 4: Chain-of-Thought with Validation

# Chain with self-correction
generation_chain = LLMChain(llm=llm, prompt=generation_prompt)
validation_chain = LLMChain(llm=llm, prompt=validation_prompt)
correction_chain = LLMChain(llm=llm, prompt=correction_prompt)

def generate_with_validation(input_text):
    output = generation_chain.run(input_text)
    is_valid = validation_chain.run(output)
    
    if not is_valid:
        output = correction_chain.run({"original": output, "input": input_text})
    
    return output

Use when: High-stakes outputs, compliance requirements, quality-critical applications

Decision Framework: Choosing Your Architecture

Ask yourself these questions:

1. Is the workflow predictable?

Yes → Start with chains
No → Consider agents

2. Do you need external data or actions?

Yes → Add tools
No → Pure LLM chains may suffice

3. How long are conversations?

< 10 exchanges → Buffer memory
10-50 exchanges → Window/summary memory
50+ or multi-session → Vector store memory

4. What’s your error tolerance?

Low → Use chains with validation
Medium → Agents with guardrails
High → Agents with retry logic

5. What’s your latency budget?

< 2s → Simple chains, minimal memory
2-10s → Agents with few tools
10s → Complex agents, rich memory

6. What’s your cost sensitivity?

High → Chains, window memory, smaller models
Medium → Agents with tool limits, summary memory
Low → Full agents, vector memory, GPT-4

Conclusion

LLM orchestration is about choosing the right abstraction for your problem:

Chains give you control and predictability—perfect for well-defined workflows
Tools extend capabilities beyond text generation—essential for real-world integration
Agents provide flexibility and autonomy—powerful but require careful design
Memory maintains context—critical for conversational and personalized experiences

Start simple: build a chain, add tools as needed, introduce agents when workflows become dynamic, and layer in memory when context matters. Test extensively, monitor costs, and iterate based on real usage patterns.

The frameworks are powerful, but the architecture is yours to design. Understanding these primitives empowers you to build LLM applications that are not just impressive demos, but production-ready systems that solve real problems reliably and efficiently.

Now go build something amazing.

LLM Orchestration Patterns: Chains, Agents, Tools, and Memory in LangChain and LlamaIndex

LLM Orchestration Patterns: Chains, Agents, Tools, and Memory

Understanding the Orchestration Landscape

Chains: The Foundation of Orchestration

Simple Chains: Linear Execution

Sequential Chains: Multi-Step Processing

Parallel Chains: Concurrent Execution

Router Chains: Conditional Logic

Chain Design Principles

Tools: Extending LLM Capabilities

Anatomy of a Tool

Tool Design Principles

Tool Composition

Agents: Autonomous Decision-Making

The ReAct Pattern

Agent Types and When to Use Them

Agent Design Patterns

When to Use Agents vs. Chains

Memory: Maintaining Context

Conversation Buffer Memory

Conversation Buffer Window Memory

Conversation Summary Memory

Vector Store Memory

Entity Memory

Combining Memory Types

Memory Design Principles

Putting It All Together: Architecture Patterns

Pattern 1: Simple RAG (Retrieval-Augmented Generation)

Pattern 2: Conversational Agent with Tools

Pattern 3: Multi-Agent System

Pattern 4: Chain-of-Thought with Validation

Decision Framework: Choosing Your Architecture

Conclusion

Comments