Introduction
Retrieval-Augmented Generation (RAG) has become the standard architecture for building AI applications that need access to specific knowledge. By combining the power of large language models with targeted information retrieval, RAG enables AI systems that are accurate, verifiable, and grounded in your data.
This comprehensive guide covers RAG architecture, implementation, and optimization.
Why RAG?
The Problem with Pure LLMs
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Limitations of Pure LLM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โ Factual Hallucinations โ
โ LMs can generate incorrect information โ
โ โ
โ โ Outdated Knowledge โ
โ Training cutoff means no awareness of recent events โ
โ โ
โ โ No Access to Private Data โ
โ Can't query your documents, databases, APIs โ
โ โ
โ โ Can't Verify Sources โ
โ No way to cite or reference information โ
โ โ
โ โ Context Window Limits โ
โ Can't include all relevant documents โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
How RAG Solves These
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAG Architecture โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ Query โ โโโโโบโ Embed Model โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ Vector Database โ โ
โ โ (Search) โ โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ Relevant Docs โ โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ LLM โ โโโโโโ Context + โ โ
โ โ (Generate) โ โ Retrieved Docs โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core Components
1. Document Processing Pipeline
from pathlib import Path
from typing import List, Dict, Any
import PyPDF2
import docx
class DocumentProcessor:
"""Process various document formats into chunks."""
def __init__(self, chunk_size: int = 1000, overlap: int = 200):
self.chunk_size = chunk_size
self.overlap = overlap
def process(self, file_path: str) -> List[Dict[str, Any]]:
"""Process a file and return chunks."""
ext = Path(file_path).suffix.lower()
if ext == '.pdf':
text = self.read_pdf(file_path)
elif ext == '.docx':
text = self.read_docx(file_path)
elif ext == '.txt':
text = self.read_txt(file_path)
else:
raise ValueError(f"Unsupported file type: {ext}")
return self.chunk_text(text, file_path)
def read_pdf(self, path: str) -> str:
text = []
with open(path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
for page in reader.pages:
text.append(page.extract_text())
return '\n\n'.join(text)
def read_docx(self, path: str) -> str:
doc = docx.Document(path)
return '\n\n'.join([p.text for p in doc.paragraphs])
def read_txt(self, path: str) -> str:
with open(path, 'r') as f:
return f.read()
def chunk_text(self, text: str, source: str) -> List[Dict[str, Any]]:
"""Split text into overlapping chunks."""
chunks = []
# Simple chunking
start = 0
while start < len(text):
end = start + self.chunk_size
# Try to break at sentence boundary
if end < len(text):
for sep in ['. ', '! ', '? ', '\n']:
last_sep = text.rfind(sep, start, end)
if last_sep > start:
end = last_sep + len(sep)
break
chunk = text[start:end].strip()
if chunk:
chunks.append({
"text": chunk,
"source": source,
"start": start,
"end": end
})
start = end - self.overlap
return chunks
2. Embedding Generation
from typing import List
import numpy as np
class Embedder:
"""Generate embeddings for text chunks."""
def __init__(self, model_name: str = "text-embedding-3-small"):
self.model_name = model_name
# Initialize embedding model
self.model = self._load_model(model_name)
def embed(self, texts: List[str]) -> List[List[float]]:
"""Generate embeddings for a list of texts."""
# Call embedding API
response = self.model.encode(texts)
# Return as list of floats
return [embedding.tolist() for embedding in response]
def embed_query(self, query: str) -> List[float]:
"""Embed a single query."""
return self.embed([query])[0]
def _load_model(self, model_name: str):
# Load model - could be OpenAI, Cohere, HuggingFace, etc.
pass
3. Vector Database
from typing import List, Tuple, Optional
import numpy as np
class VectorStore:
"""Vector database for similarity search."""
def __init__(self, dimension: int = 1536):
self.dimension = dimension
self.vectors = [] # List of embeddings
self.metadata = [] # List of metadata
self.documents = [] # List of text chunks
def add(self, chunks: List[Dict], embeddings: List[List[float]]):
"""Add chunks with embeddings to the store."""
for chunk, embedding in zip(chunks, embeddings):
self.vectors.append(np.array(embedding))
self.metadata.append({
"source": chunk["source"],
"start": chunk.get("start"),
"end": chunk.get("end")
})
self.documents.append(chunk["text"])
def search(self, query_embedding: List[float], k: int = 5) -> List[Dict]:
"""Find k most similar chunks."""
query = np.array(query_embedding)
# Calculate similarities
similarities = []
for i, vec in enumerate(self.vectors):
# Cosine similarity
sim = np.dot(query, vec) / (np.linalg.norm(query) * np.linalg.norm(vec))
similarities.append((i, sim))
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
# Return top k
results = []
for idx, score in similarities[:k]:
results.append({
"text": self.documents[idx],
"metadata": self.metadata[idx],
"score": float(score)
})
return results
def save(self, path: str):
"""Save to disk."""
np.save(f"{path}/vectors.npy", self.vectors)
# Save metadata and documents
def load(self, path: str):
"""Load from disk."""
self.vectors = np.load(f"{path}/vectors.npy")
# Load metadata and documents
RAG Pipeline
Complete Implementation
from typing import List, Dict, Any
class RAGSystem:
"""Complete RAG system."""
def __init__(
self,
chunk_size: int = 1000,
embedding_model: str = "text-embedding-3-small",
llm_model: str = "gpt-4"
):
self.processor = DocumentProcessor(chunk_size=chunk_size)
self.embedder = Embedder(embedding_model)
self.vector_store = VectorStore()
self.llm = LLM(llm_model)
def index_documents(self, file_paths: List[str]):
"""Process and index documents."""
all_chunks = []
for path in file_paths:
print(f"Processing {path}...")
chunks = self.processor.process(path)
all_chunks.extend(chunks)
print(f"Generating embeddings for {len(all_chunks)} chunks...")
texts = [c["text"] for c in all_chunks]
embeddings = self.embedder.embed(texts)
print("Adding to vector store...")
self.vector_store.add(all_chunks, embeddings)
print(f"Indexed {len(all_chunks)} chunks")
def query(self, question: str, k: int = 5) -> Dict[str, Any]:
"""Answer a question using RAG."""
# 1. Embed query
query_embedding = self.embedder.embed_query(question)
# 2. Retrieve relevant chunks
results = self.vector_store.search(query_embedding, k=k)
# 3. Build context
context = "\n\n".join([
f"[Source {i+1}]: {r['text']}"
for i, r in enumerate(results)
])
# 4. Generate answer
prompt = f"""Answer the question based on the provided context.
If the answer cannot be determined from the context, say so.
Context:
{context}
Question: {question}
Answer:"""
answer = self.llm.complete(prompt)
return {
"answer": answer,
"sources": [
{
"text": r["text"][:200] + "...",
"score": r["score"],
"metadata": r["metadata"]
}
for r in results
]
}
Advanced Techniques
1. Hybrid Search
class HybridRetriever:
"""Combine semantic and keyword search."""
def __init__(self, vector_store, bm25):
self.vector_store = vector_store # Semantic search
self.bm25 = bm25 # Keyword search
def search(self, query: str, k: int = 5, alpha: float = 0.5):
"""Combine semantic and keyword results."""
# Semantic search
query_emb = self.vector_store.embedder.embed_query(query)
semantic_results = self.vector_store.search(query_emb, k=k*2)
# Keyword search
keyword_results = self.bm25.search(query, k=k*2)
# Normalize scores
semantic_scores = {r["text"][:50]: r["score"] for r in semantic_results}
keyword_scores = {r["text"][:50]: r["score"] for r in keyword_results}
# Merge with alpha weighting
combined = {}
all_texts = set(semantic_scores.keys()) | set(keyword_scores.keys())
for text in all_texts:
sem_score = semantic_scores.get(text, 0)
key_score = keyword_scores.get(text, 0)
# Normalize
sem_norm = sem_score / max(semantic_scores.values()) if semantic_scores else 0
key_norm = key_score / max(keyword_scores.values()) if keyword_scores else 0
combined[text] = alpha * sem_norm + (1 - alpha) * key_norm
# Sort and return top k
sorted_results = sorted(combined.items(), key=lambda x: x[1], reverse=True)
return [r for r in semantic_results if r["text"][:50] in [t[0] for t in sorted_results[:k]]]
2. Re-ranking
class ReRanker:
"""Re-rank retrieved results for better relevance."""
def __init__(self, rerank_model: str = "cohere-rerank"):
self.model = rerank_model
def rerank(self, query: str, results: List[Dict], top_n: int = 3):
"""Re-rank results using cross-encoder."""
# Prepare document-content pairs
pairs = [(query, r["text"]) for r in results]
# Get relevance scores
scores = self.model.predict(pairs)
# Add scores and re-sort
for i, r in enumerate(results):
r["rerank_score"] = scores[i]
results.sort(key=lambda x: x["rerank_score"], reverse=True)
return results[:top_n]
3. Query Expansion
class QueryExpander:
"""Expand queries for better retrieval."""
def __init__(self, llm):
self.llm = llm
def expand(self, query: str) -> List[str]:
"""Generate query variations."""
prompt = f"""Generate 3 different versions of this search query that capture the same intent:
Original: {query}
Variations should:
- Use different words with similar meaning
- Include possible synonyms
- Consider different ways the question might be asked
List one variation per line:"""
response = self.llm.complete(prompt)
# Parse variations
variations = [query.strip() for query in response.split('\n') if query.strip()]
return [query] + variations[:3]
4. Parent Document Retrieval
class ParentDocumentRetriever:
"""Retrieve larger document sections, not just chunks."""
def __init__(self, child_chunk_size=500, parent_chunk_size=2000):
self.child_chunk_size = child_chunk_size
self.parent_chunk_size = parent_chunk_size
self.child_store = VectorStore()
self.parent_store = VectorStore()
def index(self, documents: List[Dict]):
"""Index at both parent and child levels."""
for doc in documents:
# Create parent chunks
parent_chunks = self.chunk(doc["text"], self.parent_chunk_size)
# Create child chunks (embedded and searched)
child_chunks = self.chunk(doc["text"], self.child_chunk_size)
# Add to respective stores
# ... (similar to basic RAG)
def search(self, query: str, k: int = 3):
"""Search children, then return parents."""
# Find relevant child chunks
child_results = self.child_store.search(query, k=k*5)
# Find which parent documents they belong to
parent_ids = set(r["metadata"]["parent_id"] for r in child_results)
# Return parent documents
return [self.parent_store.get_by_id(pid) for pid in parent_ids]
Evaluation
Metrics
class RAGEvaluator:
"""Evaluate RAG system performance."""
def evaluate(self, rag_system, test_cases: List[Dict]) -> Dict:
results = []
for case in test_cases:
answer = rag_system.query(case["question"])
results.append({
"question": case["question"],
"expected": case["expected_answer"],
"actual": answer["answer"],
"retrieved_sources": len(answer["sources"]),
"source_relevance": self.check_source_relevance(
case["question"],
answer["sources"]
)
})
return self.summarize(results)
def check_source_relevance(self, question: str, sources: List[Dict]) -> float:
"""Check if retrieved sources are relevant to question."""
# Simple heuristic: check overlap
question_words = set(question.lower().split())
relevant_count = 0
for source in sources:
source_words = set(source["text"].lower().split())
overlap = len(question_words & source_words)
if overlap > 3:
relevant_count += 1
return relevant_count / len(sources) if sources else 0
def summarize(self, results: List[Dict]) -> Dict:
return {
"total_cases": len(results),
"avg_sources_retrieved": sum(r["retrieved_sources"] for r in results) / len(results),
"avg_source_relevance": sum(r["source_relevance"] for r in results) / len(results),
"example_failures": [r for r in results if r["source_relevance"] < 0.3]
}
Best Practices
1. Chunking Strategy
# Best practices for chunking
chunking_strategies = {
"fixed_size": {
"pros": "Simple, predictable",
"cons": "May break semantic units",
"best_for": "General purpose"
},
"sentence_aware": {
"pros": "Preserves meaning",
"cons": "More complex",
"best_for": "Natural language content"
},
"recursive": {
"pros": "Multiple granularity levels",
"cons": "Complex implementation",
"best_for": "Large documents"
},
"semantic": {
"pros": "Meaningful chunks",
"cons": "Requires embedding model",
"best_for": "Structured content"
}
}
2. Embedding Selection
| Model | Dimensions | Cost | Quality | Use Case |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | Low | Good | Production |
| text-embedding-3-large | 3072 | Medium | Best | High accuracy |
| Cohere-multilingual | 1024 | Medium | Good | Multi-language |
| BGE-large | 1024 | Open source | Very good | Cost-sensitive |
3. Retrieval Optimization
# Optimization techniques
optimization_tips = [
"Use hybrid search (semantic + keyword)",
"Re-rank results for better relevance",
"Experiment with chunk sizes",
"Add query expansion",
"Use parent-document retrieval for context",
"Filter by metadata when possible",
"Monitor and tune recall/precision"
]
Conclusion
RAG has become essential for building AI applications that need access to specific knowledge. Key takeaways:
- Separate indexing from retrieval - Build once, query many times
- Chunk thoughtfully - Consider semantic boundaries
- Hybrid approaches win - Combine semantic and keyword search
- Evaluate continuously - Measure retrieval and generation quality
- Iterate - RAG optimization is empirical
With these patterns, you can build RAG systems that are accurate, efficient, and production-ready.
Comments