Retrieval-Augmented Generation (RAG): Building AI Systems with Knowledge

Introduction

Retrieval-Augmented Generation (RAG) has become the standard architecture for building AI applications that need access to specific knowledge. By combining the power of large language models with targeted information retrieval, RAG enables AI systems that are accurate, verifiable, and grounded in your data.

This comprehensive guide covers RAG architecture, implementation, and optimization.

Why RAG?

The Problem with Pure LLMs

┌─────────────────────────────────────────────────────────────┐
│              Limitations of Pure LLM                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ❌ Factual Hallucinations                                 │
│     LMs can generate incorrect information                  │
│                                                             │
│  ❌ Outdated Knowledge                                      │
│     Training cutoff means no awareness of recent events     │
│                                                             │
│  ❌ No Access to Private Data                               │
│     Can't query your documents, databases, APIs           │
│                                                             │
│  ❌ Can't Verify Sources                                    │
│     No way to cite or reference information                 │
│                                                             │
│  ❌ Context Window Limits                                   │
│     Can't include all relevant documents                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

How RAG Solves These

┌─────────────────────────────────────────────────────────────┐
│                   RAG Architecture                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌──────────────┐      ┌─────────────────┐               │
│   │   Query      │ ────►│  Embed Model     │               │
│   └──────────────┘      └────────┬────────┘               │
│                                  │                          │
│                                  ▼                          │
│                        ┌─────────────────┐                  │
│                        │ Vector Database │                  │
│                        │    (Search)     │                  │
│                        └────────┬────────┘                  │
│                                 │                          │
│                                 ▼                          │
│                        ┌─────────────────┐                  │
│                        │ Relevant Docs   │                  │
│                        └────────┬────────┘                  │
│                                 │                          │
│                                 ▼                          │
│   ┌──────────────┐      ┌─────────────────┐               │
│   │    LLM       │ ◄────│  Context +      │               │
│   │  (Generate)  │      │  Retrieved Docs │               │
│   └──────────────┘      └─────────────────┘               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Core Components

1. Document Processing Pipeline

from pathlib import Path
from typing import List, Dict, Any
import PyPDF2
import docx

class DocumentProcessor:
    """Process various document formats into chunks."""
    
    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap
    
    def process(self, file_path: str) -> List[Dict[str, Any]]:
        """Process a file and return chunks."""
        ext = Path(file_path).suffix.lower()
        
        if ext == '.pdf':
            text = self.read_pdf(file_path)
        elif ext == '.docx':
            text = self.read_docx(file_path)
        elif ext == '.txt':
            text = self.read_txt(file_path)
        else:
            raise ValueError(f"Unsupported file type: {ext}")
        
        return self.chunk_text(text, file_path)
    
    def read_pdf(self, path: str) -> str:
        text = []
        with open(path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                text.append(page.extract_text())
        return '\n\n'.join(text)
    
    def read_docx(self, path: str) -> str:
        doc = docx.Document(path)
        return '\n\n'.join([p.text for p in doc.paragraphs])
    
    def read_txt(self, path: str) -> str:
        with open(path, 'r') as f:
            return f.read()
    
    def chunk_text(self, text: str, source: str) -> List[Dict[str, Any]]:
        """Split text into overlapping chunks."""
        chunks = []
        
        # Simple chunking
        start = 0
        while start < len(text):
            end = start + self.chunk_size
            
            # Try to break at sentence boundary
            if end < len(text):
                for sep in ['. ', '! ', '? ', '\n']:
                    last_sep = text.rfind(sep, start, end)
                    if last_sep > start:
                        end = last_sep + len(sep)
                        break
            
            chunk = text[start:end].strip()
            if chunk:
                chunks.append({
                    "text": chunk,
                    "source": source,
                    "start": start,
                    "end": end
                })
            
            start = end - self.overlap
        
        return chunks

2. Embedding Generation

from typing import List
import numpy as np

class Embedder:
    """Generate embeddings for text chunks."""
    
    def __init__(self, model_name: str = "text-embedding-3-small"):
        self.model_name = model_name
        # Initialize embedding model
        self.model = self._load_model(model_name)
    
    def embed(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for a list of texts."""
        # Call embedding API
        response = self.model.encode(texts)
        
        # Return as list of floats
        return [embedding.tolist() for embedding in response]
    
    def embed_query(self, query: str) -> List[float]:
        """Embed a single query."""
        return self.embed([query])[0]
    
    def _load_model(self, model_name: str):
        # Load model - could be OpenAI, Cohere, HuggingFace, etc.
        pass

3. Vector Database

from typing import List, Tuple, Optional
import numpy as np

class VectorStore:
    """Vector database for similarity search."""
    
    def __init__(self, dimension: int = 1536):
        self.dimension = dimension
        self.vectors = []      # List of embeddings
        self.metadata = []    # List of metadata
        self.documents = []   # List of text chunks
    
    def add(self, chunks: List[Dict], embeddings: List[List[float]]):
        """Add chunks with embeddings to the store."""
        for chunk, embedding in zip(chunks, embeddings):
            self.vectors.append(np.array(embedding))
            self.metadata.append({
                "source": chunk["source"],
                "start": chunk.get("start"),
                "end": chunk.get("end")
            })
            self.documents.append(chunk["text"])
    
    def search(self, query_embedding: List[float], k: int = 5) -> List[Dict]:
        """Find k most similar chunks."""
        query = np.array(query_embedding)
        
        # Calculate similarities
        similarities = []
        for i, vec in enumerate(self.vectors):
            # Cosine similarity
            sim = np.dot(query, vec) / (np.linalg.norm(query) * np.linalg.norm(vec))
            similarities.append((i, sim))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k
        results = []
        for idx, score in similarities[:k]:
            results.append({
                "text": self.documents[idx],
                "metadata": self.metadata[idx],
                "score": float(score)
            })
        
        return results
    
    def save(self, path: str):
        """Save to disk."""
        np.save(f"{path}/vectors.npy", self.vectors)
        # Save metadata and documents
    
    def load(self, path: str):
        """Load from disk."""
        self.vectors = np.load(f"{path}/vectors.npy")
        # Load metadata and documents

RAG Pipeline

Complete Implementation

from typing import List, Dict, Any

class RAGSystem:
    """Complete RAG system."""
    
    def __init__(
        self,
        chunk_size: int = 1000,
        embedding_model: str = "text-embedding-3-small",
        llm_model: str = "gpt-4"
    ):
        self.processor = DocumentProcessor(chunk_size=chunk_size)
        self.embedder = Embedder(embedding_model)
        self.vector_store = VectorStore()
        self.llm = LLM(llm_model)
    
    def index_documents(self, file_paths: List[str]):
        """Process and index documents."""
        all_chunks = []
        
        for path in file_paths:
            print(f"Processing {path}...")
            chunks = self.processor.process(path)
            all_chunks.extend(chunks)
        
        print(f"Generating embeddings for {len(all_chunks)} chunks...")
        texts = [c["text"] for c in all_chunks]
        embeddings = self.embedder.embed(texts)
        
        print("Adding to vector store...")
        self.vector_store.add(all_chunks, embeddings)
        
        print(f"Indexed {len(all_chunks)} chunks")
    
    def query(self, question: str, k: int = 5) -> Dict[str, Any]:
        """Answer a question using RAG."""
        
        # 1. Embed query
        query_embedding = self.embedder.embed_query(question)
        
        # 2. Retrieve relevant chunks
        results = self.vector_store.search(query_embedding, k=k)
        
        # 3. Build context
        context = "\n\n".join([
            f"[Source {i+1}]: {r['text']}"
            for i, r in enumerate(results)
        ])
        
        # 4. Generate answer
        prompt = f"""Answer the question based on the provided context.
If the answer cannot be determined from the context, say so.

Context:
{context}

Question: {question}

Answer:"""
        
        answer = self.llm.complete(prompt)
        
        return {
            "answer": answer,
            "sources": [
                {
                    "text": r["text"][:200] + "...",
                    "score": r["score"],
                    "metadata": r["metadata"]
                }
                for r in results
            ]
        }

Advanced Techniques

1. Hybrid Search

class HybridRetriever:
    """Combine semantic and keyword search."""
    
    def __init__(self, vector_store, bm25):
        self.vector_store = vector_store  # Semantic search
        self.bm25 = bm25                  # Keyword search
    
    def search(self, query: str, k: int = 5, alpha: float = 0.5):
        """Combine semantic and keyword results."""
        
        # Semantic search
        query_emb = self.vector_store.embedder.embed_query(query)
        semantic_results = self.vector_store.search(query_emb, k=k*2)
        
        # Keyword search
        keyword_results = self.bm25.search(query, k=k*2)
        
        # Normalize scores
        semantic_scores = {r["text"][:50]: r["score"] for r in semantic_results}
        keyword_scores = {r["text"][:50]: r["score"] for r in keyword_results}
        
        # Merge with alpha weighting
        combined = {}
        all_texts = set(semantic_scores.keys()) | set(keyword_scores.keys())
        
        for text in all_texts:
            sem_score = semantic_scores.get(text, 0)
            key_score = keyword_scores.get(text, 0)
            
            # Normalize
            sem_norm = sem_score / max(semantic_scores.values()) if semantic_scores else 0
            key_norm = key_score / max(keyword_scores.values()) if keyword_scores else 0
            
            combined[text] = alpha * sem_norm + (1 - alpha) * key_norm
        
        # Sort and return top k
        sorted_results = sorted(combined.items(), key=lambda x: x[1], reverse=True)
        
        return [r for r in semantic_results if r["text"][:50] in [t[0] for t in sorted_results[:k]]]

2. Re-ranking

class ReRanker:
    """Re-rank retrieved results for better relevance."""
    
    def __init__(self, rerank_model: str = "cohere-rerank"):
        self.model = rerank_model
    
    def rerank(self, query: str, results: List[Dict], top_n: int = 3):
        """Re-rank results using cross-encoder."""
        
        # Prepare document-content pairs
        pairs = [(query, r["text"]) for r in results]
        
        # Get relevance scores
        scores = self.model.predict(pairs)
        
        # Add scores and re-sort
        for i, r in enumerate(results):
            r["rerank_score"] = scores[i]
        
        results.sort(key=lambda x: x["rerank_score"], reverse=True)
        
        return results[:top_n]

3. Query Expansion

class QueryExpander:
    """Expand queries for better retrieval."""
    
    def __init__(self, llm):
        self.llm = llm
    
    def expand(self, query: str) -> List[str]:
        """Generate query variations."""
        
        prompt = f"""Generate 3 different versions of this search query that capture the same intent:

Original: {query}

Variations should:
- Use different words with similar meaning
- Include possible synonyms
- Consider different ways the question might be asked

List one variation per line:"""
        
        response = self.llm.complete(prompt)
        
        # Parse variations
        variations = [query.strip() for query in response.split('\n') if query.strip()]
        
        return [query] + variations[:3]

4. Parent Document Retrieval

class ParentDocumentRetriever:
    """Retrieve larger document sections, not just chunks."""
    
    def __init__(self, child_chunk_size=500, parent_chunk_size=2000):
        self.child_chunk_size = child_chunk_size
        self.parent_chunk_size = parent_chunk_size
        self.child_store = VectorStore()
        self.parent_store = VectorStore()
    
    def index(self, documents: List[Dict]):
        """Index at both parent and child levels."""
        
        for doc in documents:
            # Create parent chunks
            parent_chunks = self.chunk(doc["text"], self.parent_chunk_size)
            
            # Create child chunks (embedded and searched)
            child_chunks = self.chunk(doc["text"], self.child_chunk_size)
            
            # Add to respective stores
            # ... (similar to basic RAG)
    
    def search(self, query: str, k: int = 3):
        """Search children, then return parents."""
        
        # Find relevant child chunks
        child_results = self.child_store.search(query, k=k*5)
        
        # Find which parent documents they belong to
        parent_ids = set(r["metadata"]["parent_id"] for r in child_results)
        
        # Return parent documents
        return [self.parent_store.get_by_id(pid) for pid in parent_ids]

Evaluation

Metrics

class RAGEvaluator:
    """Evaluate RAG system performance."""
    
    def evaluate(self, rag_system, test_cases: List[Dict]) -> Dict:
        results = []
        
        for case in test_cases:
            answer = rag_system.query(case["question"])
            
            results.append({
                "question": case["question"],
                "expected": case["expected_answer"],
                "actual": answer["answer"],
                "retrieved_sources": len(answer["sources"]),
                "source_relevance": self.check_source_relevance(
                    case["question"],
                    answer["sources"]
                )
            })
        
        return self.summarize(results)
    
    def check_source_relevance(self, question: str, sources: List[Dict]) -> float:
        """Check if retrieved sources are relevant to question."""
        # Simple heuristic: check overlap
        question_words = set(question.lower().split())
        
        relevant_count = 0
        for source in sources:
            source_words = set(source["text"].lower().split())
            overlap = len(question_words & source_words)
            if overlap > 3:
                relevant_count += 1
        
        return relevant_count / len(sources) if sources else 0
    
    def summarize(self, results: List[Dict]) -> Dict:
        return {
            "total_cases": len(results),
            "avg_sources_retrieved": sum(r["retrieved_sources"] for r in results) / len(results),
            "avg_source_relevance": sum(r["source_relevance"] for r in results) / len(results),
            "example_failures": [r for r in results if r["source_relevance"] < 0.3]
        }

Best Practices

1. Chunking Strategy

# Best practices for chunking
chunking_strategies = {
    "fixed_size": {
        "pros": "Simple, predictable",
        "cons": "May break semantic units",
        "best_for": "General purpose"
    },
    "sentence_aware": {
        "pros": "Preserves meaning",
        "cons": "More complex",
        "best_for": "Natural language content"
    },
    "recursive": {
        "pros": "Multiple granularity levels",
        "cons": "Complex implementation",
        "best_for": "Large documents"
    },
    "semantic": {
        "pros": "Meaningful chunks",
        "cons": "Requires embedding model",
        "best_for": "Structured content"
    }
}

2. Embedding Selection

Model	Dimensions	Cost	Quality	Use Case
text-embedding-3-small	1536	Low	Good	Production
text-embedding-3-large	3072	Medium	Best	High accuracy
Cohere-multilingual	1024	Medium	Good	Multi-language
BGE-large	1024	Open source	Very good	Cost-sensitive

3. Retrieval Optimization

# Optimization techniques
optimization_tips = [
    "Use hybrid search (semantic + keyword)",
    "Re-rank results for better relevance",
    "Experiment with chunk sizes",
    "Add query expansion",
    "Use parent-document retrieval for context",
    "Filter by metadata when possible",
    "Monitor and tune recall/precision"
]

Conclusion

RAG has become essential for building AI applications that need access to specific knowledge. Key takeaways:

Separate indexing from retrieval - Build once, query many times
Chunk thoughtfully - Consider semantic boundaries
Hybrid approaches win - Combine semantic and keyword search
Evaluate continuously - Measure retrieval and generation quality
Iterate - RAG optimization is empirical

With these patterns, you can build RAG systems that are accurate, efficient, and production-ready.