Introduction
Retrieval-Augmented Generation has transformed how we build AI applications that require access to external knowledge. At the heart of any RAG system lies the retrieval mechanism—the component responsible for finding the most relevant information to feed into the language model. For years, the dominant approach has been dense vector search, where documents are converted into embeddings and retrieved based on semantic similarity. While powerful, this approach has notable limitations that can significantly impact retrieval quality.
Hybrid search emerges as the solution that addresses these limitations by combining multiple retrieval strategies. Rather than relying on a single approach, hybrid systems intelligently combine dense vector embeddings, sparse keyword-based retrieval (like BM25), and increasingly, graph-based traversal. This multi-strategy approach captures different aspects of relevance—semantic meaning, exact term matching, and structural relationships—that no single method can fully capture.
In this comprehensive guide, we’ll explore hybrid search from fundamentals to production implementation. You’ll understand why single-strategy retrieval often fails, how different search algorithms complement each other, practical implementation approaches using modern frameworks, and strategies for optimizing hybrid RAG systems in 2026.
The Problem with Single-Strategy Retrieval
Limitations of Pure Vector Search
Dense vector search revolutionized information retrieval by enabling semantic understanding. Instead of matching exact keywords, embeddings capture meaning, allowing systems to find relevant documents even when they don’t share vocabulary. A query about “vehicles that don’t need gasoline” can successfully retrieve documents about “electric cars” because the embeddings place these concepts close in vector space.
However, vector search has significant blind spots. The first major issue is vocabulary mismatch. Embeddings represent documents as points in a high-dimensional space, but this compression inevitably loses information. A document about “artificial intelligence” might be embedded near “machine learning” but far from “AI” abbreviations or domain-specific terminology. Users often search using different words than those in relevant documents, and semantic similarity alone cannot bridge all vocabulary gaps.
The second issue is sensitivity to embedding quality. Embedding models are trained on specific data distributions and perform variably across domains. A model trained primarily on technical documentation might struggle with legal contracts or medical records. Poor-quality embeddings lead to poor retrieval, yet evaluating embedding quality is challenging and often overlooked.
The third problem is lack of interpretability. When vector search retrieves documents, it’s difficult to understand why a particular document was selected. You can examine the similarity score, but this doesn’t explain which parts of the document matched or how the semantic matching worked. This opacity makes debugging retrieval failures challenging.
Finally, there’s the computational cost. Vector search requires computing similarity between the query embedding and every document embedding in the collection. While approximate nearest neighbor (ANN) algorithms significantly speed this up, it remains computationally intensive, especially for large-scale systems.
Limitations of Pure Keyword Search
Keyword-based search, exemplified by algorithms like BM25, has served information retrieval for decades. It excels at exact matching—finding documents that contain specific terms. This makes it particularly effective for queries with specific identifiers, technical terms, or proper nouns where semantic understanding would be less reliable.
However, keyword search has its own severe limitations. The most critical is the vocabulary problem—you must know the exact terms used in relevant documents. Searching for “vehicle propulsion” won’t find documents discussing “car engines” or “automotive motors.” This limitation is particularly problematic in domains with diverse terminology, synonyms, and abbreviations.
Keyword search also struggles with understanding context. The same word can have different meanings in different contexts. A query for “Java” might match documents about the programming language, the island, or the coffee—but keyword search treats them all equally without understanding intent.
Additionally, keyword search cannot handle natural language queries well. Users increasingly search using full sentences and questions rather than keyword lists. While techniques like query expansion can help, they’re fundamentally limited without semantic understanding.
The Hybrid Solution
Hybrid search addresses these limitations by combining multiple retrieval strategies. Each approach compensates for the other’s weaknesses:
Vector search provides semantic understanding, capturing meaning beyond exact word matches. It can find related concepts even when vocabulary differs.
Keyword search ensures exact term matching, critical for technical queries, specific names, and domain-specific terminology. It provides recall for precise queries.
Graph traversal adds relationship understanding, enabling multi-hop reasoning and finding documents connected through explicit relationships.
The key insight is that different queries benefit from different strategies. A query like “What is ChatGPT?” might be answered well by semantic search, while “OpenAI API pricing” benefits from exact keyword matching. Hybrid systems can detect query characteristics and weight strategies accordingly.
Research consistently shows hybrid approaches outperform single-strategy retrieval. Studies demonstrate improvements of 10-30% in retrieval accuracy across various datasets and domains. This makes hybrid search the default choice for production RAG systems.
Understanding Search Algorithms
Dense Vector Search
Dense vector search represents documents and queries as dense vectors—arrays of floating-point numbers with most values non-zero. These vectors are produced by embedding models that encode semantic meaning.
The process works as follows: First, an embedding model converts text into a fixed-dimensional vector, typically 384 to 4096 dimensions depending on the model. Modern models like OpenAI’s text-embedding-3, Cohere’s embed-multilingual, or open-source options like BGE produce high-quality embeddings.
At query time, the same model converts the query into a vector. The system then computes similarity between the query vector and all document vectors. Similarity is typically measured using cosine similarity, dot product, or Euclidean distance.
For efficiency, approximate nearest neighbor (ANN) algorithms index vectors to enable fast searching. Popular libraries include FAISS (Facebook AI Similarity Search), HNSW (Hierarchical Navigable Small World), and ScaNN. These algorithms sacrifice some accuracy for dramatic speed improvements, making search over millions of documents feasible.
Key parameters in vector search include the embedding model choice (which determines what relationships are captured), the number of dimensions (more dimensions capture more nuance but require more memory), the similarity metric (cosine works well for text), and the index type (HNSW provides good recall-speed tradeoffs).
Sparse Keyword Search (BM25)
BM25 (Best Matching 25) is a ranking function used in information retrieval. It ranks documents based on the frequency of query terms appearing in each document, with length normalization to prevent bias toward longer documents.
BM25 originated from probabilistic information retrieval models and remains highly effective despite its age. Its enduring popularity stems from several advantages: it’s fast, requires no training data, is interpretable (you can see exactly which terms matched), and excels at exact term matching.
The BM25 scoring formula considers term frequency (how often the query term appears in the document), inverse document frequency (how rare the term is across the corpus), and document length normalization. Terms that appear frequently in a document but rarely across the corpus score highest.
Implementation is straightforward. search Most engines like Elasticsearch and Solr include BM25 by default. In Python, the rank_bm25 library provides pure Python implementations, while pylucene offers Java-based speed.
Key parameters include the BM25 k1 parameter (controls term frequency saturation—you might set it to 1.5) and the b parameter (controls length normalization—set b=0.75 typically). These can be tuned for specific corpora.
Learning to Rank (LTR)
Learning to Rank (LTR) applies machine learning to combine multiple relevance signals into a single ranking function. Rather than hand-tuning combination weights, you train a model to predict relevance scores.
A typical LTR pipeline collects training data where query-document pairs are labeled with relevance scores (often through explicit judgments or implicit feedback like clicks). Features are extracted from each query-document pair—including BM25 scores, vector similarity scores, document metadata, and query-document overlap metrics. A ranking model (often a gradient boosted decision tree or neural network) is trained to predict relevance from these features.
At inference time, features are computed for new query-document pairs, and the model outputs a combined relevance score.
LTR is particularly powerful when you have significant training data and want to optimize for specific metrics. However, it requires more setup than simple score combination.
Reciprocal Rank Fusion
Reciprocal Rank Fusion (RRF) is a simple yet effective technique for combining ranked lists from different retrieval strategies. Rather than combining raw scores (which aren’t directly comparable between systems), RRF combines rankings.
The RRF score for a document is computed as:
RRF_score = sum(1 / (rank + k)) for each retrieval strategy
Where rank is the document’s position in each strategy’s ranked list, and k is a constant (typically 60) that prevents extreme scores from dominating.
RRF is appealing because it requires no training, handles different score ranges automatically, and often performs nearly as well as more complex methods. It’s an excellent starting point for hybrid search.
Implementing Hybrid Search RAG
Architecture Overview
A production hybrid search RAG system consists of several components working together:
The document processing pipeline ingests documents, splits them into chunks appropriate for retrieval, and prepares them for indexing. Chunk size significantly impacts retrieval quality—too small loses context, too large dilutes relevance.
The indexing pipeline builds multiple indexes in parallel: a vector index for semantic search, a BM25 index for keyword search, and potentially a graph index for relationship-based retrieval. Each index must be kept synchronized with document updates.
The retrieval pipeline receives queries, executes multiple retrieval strategies, combines results, and returns a unified ranked list. This is where hybrid logic lives.
The generation pipeline takes retrieved context and generates answers using an LLM. It handles prompt construction, answer generation, and potentially citation generation.
Let’s build a complete implementation.
Setting Up the Environment
# Requirements
# pip install qdrant-client pymilvus rank-bm25 langchain langchain-openai sentence-transformers
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")
Document Processing
The first step is preparing documents for indexing:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from typing import List, Dict, Any
import hashlib
class DocumentProcessor:
"""Processes documents for hybrid indexing."""
def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
def process(self, text: str, metadata: Dict[str, Any] = None) -> List[Document]:
"""Split text into chunks and create documents."""
chunks = self.text_splitter.split_text(text)
documents = []
for i, chunk in enumerate(chunks):
chunk_metadata = (metadata or {}).copy()
chunk_metadata["chunk_id"] = i
chunk_metadata["chunk_hash"] = hashlib.md5(chunk.encode()).hexdigest()
documents.append(Document(
page_content=chunk,
metadata=chunk_metadata
))
return documents
def process_batch(self, documents: List[Dict]) -> List[Document]:
"""Process multiple documents."""
all_chunks = []
for doc in documents:
chunks = self.process(doc.get("text", ""), doc.get("metadata", {}))
all_chunks.extend(chunks)
return all_chunks
Vector Index Implementation
Let’s create a vector search index using Qdrant (though similar patterns apply to Pinecone, Milvus, or Weaviate):
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from sentence_transformers import SentenceTransformer
from typing import List, Optional
import numpy as np
class VectorIndex:
"""Vector search index using Qdrant."""
def __init__(self, collection_name: str = "documents", embedding_model: str = "BAAI/bge-small-en"):
self.collection_name = collection_name
self.client = QdrantClient(":memory:") # Use ":memory:" for demo, URL for production
self.embedding_model = SentenceTransformer(embedding_model)
self.dimension = self.embedding_model.get_sentence_embedding_dimension()
self._create_collection()
def _create_collection(self):
"""Create collection with appropriate settings."""
collections = self.client.get_collections().collections
if self.collection_name not in [c.name for c in collections]:
self.client.create_collection(
collection_name=self.collection_name,
vectors_config=VectorParams(
size=self.dimension,
distance=Distance.COSINE
)
)
def index_documents(self, documents: List[Document], batch_size: int = 100):
"""Index documents with their embeddings."""
points = []
for i, doc in enumerate(documents):
embedding = self.embedding_model.encode(doc.page_content)
point = PointStruct(
id=i,
vector=embedding.tolist(),
payload={
"text": doc.page_content,
"metadata": doc.metadata
}
)
points.append(point)
if len(points) >= batch_size:
self.client.upsert(
collection_name=self.collection_name,
points=points
)
points = []
if points:
self.client.upsert(
collection_name=self.collection_name,
points=points
)
def search(self, query: str, top_k: int = 10) -> List[Dict]:
"""Search for similar documents."""
query_embedding = self.embedding_model.encode(query)
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_embedding.tolist(),
limit=top_k
)
return [
{
"id": r.id,
"text": r.payload["text"],
"metadata": r.payload["metadata"],
"score": r.score
}
for r in results
]
BM25 Index Implementation
Now let’s implement the BM25 index:
import rank_bm25
import re
from typing import List, Dict
from collections import Counter
class BM25Index:
"""BM25-based keyword search index."""
def __init__(self, k1: float = 1.5, b: float = 0.75):
self.k1 = k1
self.b = b
self.corpus = []
self.corpus_tokenized = []
self.bm25 = None
self.doc_metadata = []
def _tokenize(self, text: str) -> List[str]:
"""Simple tokenization: lowercase and split on non-alphanumeric."""
text = text.lower()
tokens = re.findall(r'\b\w+\b', text)
return tokens
def index_documents(self, documents: List[Document]):
"""Index documents for BM25 search."""
self.corpus = [doc.page_content for doc in documents]
self.corpus_tokenized = [self._tokenize(doc.page_content) for doc in documents]
self.doc_metadata = [doc.metadata for doc in documents]
self.bm25 = rank_bm25.BM25Okapi(self.corpus_tokenized)
def search(self, query: str, top_k: int = 10) -> List[Dict]:
"""Search for documents matching query terms."""
query_tokens = self._tokenize(query)
scores = self.bm25.get_scores(query_tokens)
# Get top-k document indices
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
results = []
for idx in top_indices:
if scores[idx] > 0:
results.append({
"id": idx,
"text": self.corpus[idx],
"metadata": self.doc_metadata[idx],
"score": scores[idx]
})
return results
def update(self, documents: List[Document]):
"""Update index with new documents."""
self.index_documents(documents)
Hybrid Retrieval Implementation
Now let’s combine both approaches:
from typing import List, Dict, Tuple
import numpy as np
class HybridRetriever:
"""Combines vector and keyword search with flexible weighting."""
def __init__(
self,
vector_index: VectorIndex,
bm25_index: BM25Index,
vector_weight: float = 0.5,
rrf_k: int = 60
):
self.vector_index = vector_index
self.bm25_index = bm25_index
self.vector_weight = vector_weight
self.rrf_k = rrf_k
def search(
self,
query: str,
top_k: int = 10,
alpha: float = None
) -> List[Dict]:
"""Perform hybrid search combining vector and keyword results."""
# Determine alpha (weight) if not provided
if alpha is None:
alpha = self._determine_weight(query)
# Get results from each method
vector_results = self.vector_index.search(query, top_k * 2)
bm25_results = self.bm25_index.search(query, top_k * 2)
# Combine using weighted RRF
combined = self._weighted_rrf(
vector_results,
bm25_results,
alpha
)
return combined[:top_k]
def _determine_weight(self, query: str) -> float:
"""
Determine search strategy weight based on query characteristics.
Returns alpha where:
- alpha = 1.0 means pure keyword search
- alpha = 0.0 means pure vector search
- alpha = 0.5 means equal weighting
"""
query_lower = query.lower()
# Count indicators of keyword vs semantic search intent
keyword_indicators = [
"exact", "specific", "named", "called", "term",
"definition", "what is the", "how to", "syntax"
]
semantic_indicators = [
"related to", "similar to", "like", "kinds of",
"examples of", "explain", "describe", "about"
]
keyword_score = sum(1 for ind in keyword_indicators if ind in query_lower)
semantic_score = sum(1 for ind in semantic_indicators if ind in query_lower)
if keyword_score > semantic_score:
return 0.7 # Favor keyword search
elif semantic_score > keyword_score:
return 0.3 # Favor vector search
else:
return 0.5 # Equal weighting
def _weighted_rrf(
self,
vector_results: List[Dict],
bm25_results: List[Dict],
alpha: float
) -> List[Dict]:
"""Combine results using weighted reciprocal rank fusion."""
# Create rank maps
vector_ranks = {r["id"]: rank for rank, r in enumerate(vector_results)}
bm25_ranks = {r["id"]: rank for rank, r in enumerate(bm25_results)}
# Get all unique document IDs
all_ids = set(vector_ranks.keys()) | set(bm25_ranks.keys())
# Calculate RRF scores
scores = {}
for doc_id in all_ids:
vector_rank = vector_ranks.get(doc_id, float('inf'))
bm25_rank = bm25_ranks.get(doc_id, float('inf'))
vector_score = 1.0 / (vector_rank + self.rrf_k) if vector_rank < len(vector_results) else 0
bm25_score = 1.0 / (bm25_rank + self.rrf_k) if bm25_rank < len(bm25_results) else 0
# Weighted combination
rrf_score = (1 - alpha) * vector_score + alpha * bm25_score
scores[doc_id] = rrf_score
# Sort by score
sorted_ids = sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Build result list with combined scores
id_to_doc = {r["id"]: r for r in vector_results + bm25_results}
results = []
for doc_id, score in sorted_ids:
if doc_id in id_to_doc:
result = id_to_doc[doc_id].copy()
result["hybrid_score"] = score
result["alpha"] = alpha
results.append(result)
return results
def search_with_fallback(
self,
query: str,
primary_method: str = "hybrid",
top_k: int = 10
) -> List[Dict]:
"""Search with fallback strategy."""
if primary_method == "hybrid":
return self.search(query, top_k)
elif primary_method == "vector":
results = self.vector_index.search(query, top_k)
if len(results) >= top_k or self._has_high_confidence(results):
return results
# Fallback to hybrid if vector results insufficient
return self.search(query, top_k, alpha=0.3)
elif primary_method == "keyword":
results = self.bm25_index.search(query, top_k)
if len(results) >= top_k:
return results
# Fallback to hybrid if keyword results insufficient
return self.search(query, top_k, alpha=0.7)
return []
def _has_high_confidence(self, results: List[Dict], threshold: float = 0.8) -> bool:
"""Check if vector results have high confidence."""
if not results:
return False
return results[0].get("score", 0) >= threshold
Complete RAG Integration
Now let’s integrate with an LLM for complete RAG:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
class HybridRAG:
"""Complete RAG system with hybrid search."""
def __init__(
self,
retriever: HybridRetriever,
llm: ChatOpenAI = None
):
self.retriever = retriever
self.llm = llm or ChatOpenAI(model="gpt-4o")
def query(self, query: str, top_k: int = 5, return_context: bool = False) -> Dict:
"""Query the RAG system."""
# Retrieve relevant documents
results = self.retriever.search(query, top_k=top_k)
if not results:
return {
"answer": "I couldn't find relevant information to answer your question.",
"sources": [],
"context": []
}
# Build context from retrieved documents
context = self._build_context(results)
# Generate answer
answer = self._generate(query, context)
response = {
"answer": answer,
"sources": [
{
"text": r["text"][:200] + "..." if len(r["text"]) > 200 else r["text"],
"score": r.get("hybrid_score", r.get("score", 0)),
"method": "vector" if r.get("alpha", 0.5) < 0.5 else "keyword" if r.get("alpha", 0.5) > 0.5 else "hybrid"
}
for r in results
]
}
if return_context:
response["context"] = context
return response
def _build_context(self, results: List[Dict]) -> str:
"""Build context string from retrieved documents."""
context_parts = []
for i, result in enumerate(results, 1):
context_parts.append(
f"[Document {i}]\n{result['text']}\n"
)
return "\n\n".join(context_parts)
def _generate(self, query: str, context: str) -> str:
"""Generate answer using LLM."""
prompt = ChatPromptTemplate.from_template("""You are a helpful AI assistant. Use the provided context to answer the question.
Context:
{context}
Question: {question}
Instructions:
- Use only information from the provided context
- If the context doesn't contain enough information, say so
- Be concise but thorough
- Cite relevant information when possible
""")
chain = prompt | self.llm
response = chain.invoke({
"context": context,
"question": query
})
return response.content
Complete Example
Here’s how to put everything together:
# Initialize components
processor = DocumentProcessor(chunk_size=400, chunk_overlap=50)
vector_index = VectorIndex(collection_name="demo")
bm25_index = BM25Index(k1=1.5, b=0.75)
retriever = HybridRetriever(vector_index, bm25_index, vector_weight=0.5)
rag = HybridRAG(retriever)
# Sample documents
documents = [
{
"text": """
Artificial Intelligence (AI) is a broad field of computer science focused on creating
intelligent machines that can think and learn. AI includes machine learning, where
systems learn from data, and deep learning, which uses neural networks with many layers.
""",
"metadata": {"source": "ai_intro.txt", "topic": "AI"}
},
{
"text": """
Large Language Models (LLMs) are AI models trained on vast amounts of text data.
Examples include GPT-4, Claude, and Gemini. These models can generate human-like text,
answer questions, and perform various natural language tasks.
""",
"metadata": {"source": "llms.txt", "topic": "LLM"}
},
{
"text": """
Retrieval-Augmented Generation (RAG) combines language models with external knowledge
bases. When a query arrives, RAG first retrieves relevant documents, then uses them
to enhance the model's generation. This improves accuracy and allows access to
up-to-date information.
""",
"metadata": {"source": "rag.txt", "topic": "RAG"}
},
{
"text": """
Vector embeddings represent text as numerical vectors that capture semantic meaning.
Documents with similar meanings have vectors close together in the embedding space.
This enables semantic similarity search beyond keyword matching.
""",
"metadata": {"source": "embeddings.txt", "topic": "Embeddings"}
},
{
"text": """
BM25 is a ranking function used in information retrieval. It ranks documents based
on term frequency and inverse document frequency. Unlike semantic search, BM25
excels at finding documents with exact term matches.
""",
"metadata": {"source": "bm25.txt", "topic": "BM25"}
}
]
# Process and index documents
chunks = processor.process_batch(documents)
vector_index.index_documents(chunks)
bm25_index.index_documents(chunks)
# Query the system
query = "How do language models work?"
result = rag.query(query, top_k=3)
print(f"Query: {query}\n")
print(f"Answer: {result['answer']}\n")
print("Sources:")
for source in result['sources']:
print(f" - {source['method']}: {source['text'][:80]}...")
Advanced Patterns
Query Understanding and Routing
Production systems should analyze queries to route them appropriately:
class QueryRouter:
"""Routes queries to appropriate retrieval strategies."""
def __init__(self, hybrid_retriever: HybridRetriever):
self.retriever = hybrid_retriever
def analyze_query(self, query: str) -> Dict:
"""Analyze query characteristics to determine best strategy."""
query_lower = query.lower()
analysis = {
"query": query,
"length": len(query.split()),
"has_numbers": any(c.isdigit() for c in query),
"has_quotes": '"' in query or "'" in query,
"is_question": query_lower.startswith(("what", "how", "why", "when", "where", "which")),
"has_specific_terms": self._has_specific_terms(query)
}
# Determine recommended alpha
if analysis["has_quotes"] or analysis["has_numbers"]:
analysis["recommended_alpha"] = 0.8 # Favor keyword
elif analysis["has_specific_terms"]:
analysis["recommended_alpha"] = 0.6
elif analysis["is_question"] and not analysis["has_specific_terms"]:
analysis["recommended_alpha"] = 0.3 # Favor semantic
else:
analysis["recommended_alpha"] = 0.5 # Balanced
return analysis
def _has_specific_terms(self, query: str) -> bool:
"""Check if query contains specific technical terms."""
technical_patterns = [
r'\b(API|SDK|Library|Function|Method|Class)\b',
r'\b[A-Z]{2,}\b', # Acronyms
r'\b\d+\.\d+\b', # Version numbers
r'\b(error|exception|bug|issue)\b'
]
import re
return any(re.search(pattern, query) for pattern in technical_patterns)
def route_query(self, query: str, top_k: int = 10) -> Dict:
"""Route query with appropriate strategy."""
analysis = self.analyze_query(query)
results = self.retriever.search(
query,
top_k=top_k,
alpha=analysis["recommended_alpha"]
)
return {
"results": results,
"analysis": analysis
}
Multi-Stage Retrieval
Advanced systems often use multiple retrieval stages:
class MultiStageRetriever:
"""Implements retrieval with multiple stages."""
def __init__(self, retriever: HybridRetriever, reranker=None):
self.retriever = retriever
self.reranker = reranker
def retrieve(
self,
query: str,
initial_k: int = 20,
final_k: int = 5,
stages: int = 2
) -> List[Dict]:
"""Perform multi-stage retrieval."""
# Stage 1: Initial retrieval
initial_results = self.retriever.search(query, top_k=initial_k)
if stages == 1:
return initial_results[:final_k]
# Stage 2: Reranking (if reranker available)
if self.reranker:
reranked = self.reranker.rerank(query, initial_results, top_k=final_k)
return reranked
# Fallback: Use reciprocal rank fusion with more results
more_results = self.retriever.search(query, top_k=initial_k * 2)
return more_results[:final_k]
Adaptive Hybrid Search
For complex queries, decompose and solve subtasks:
class AdaptiveHybridSearch:
"""Adapts search strategy based on query complexity."""
def __init__(self, retriever: HybridRetriever):
self.retriever = retriever
def decompose_and_search(self, query: str) -> List[Dict]:
"""Decompose complex queries into simpler subqueries."""
# Simple decomposition: extract key concepts
stop_words = {"the", "a", "an", "is", "are", "was", "were", "of", "in", "on", "at", "to", "for"}
concepts = [w for w in query.lower().split() if w not in stop_words and len(w) > 2]
if len(concepts) <= 2:
# Simple query - use standard search
return self.retriever.search(query, top_k=10)
# Complex query - search each concept and combine
sub_results = []
for concept in concepts:
sub_result = self.retriever.search(
concept,
top_k=5,
alpha=0.5 # Balanced
)
sub_results.extend(sub_result)
# Deduplicate and re-rank
return self._deduplicate_and_rerank(sub_results, query)
def _deduplicate_and_rerank(self, results: List[Dict], query: str) -> List[Dict]:
"""Deduplicate results and rerank by query relevance."""
# Deduplicate by content
seen_texts = set()
unique_results = []
for result in results:
text_hash = hash(result["text"])
if text_hash not in seen_texts:
seen_texts.add(text_hash)
unique_results.append(result)
# Simple rerank: boost results matching more query terms
query_terms = set(query.lower().split())
for result in unique_results:
text_terms = set(result["text"].lower().split())
overlap = len(query_terms & text_terms)
result["query_overlap"] = overlap
# Sort by combination of scores
reranked = sorted(
unique_results,
key=lambda r: r.get("hybrid_score", 0) * (1 + 0.1 * r.get("query_overlap", 0)),
reverse=True
)
return reranked[:10]
Performance Optimization
Index Optimization
Optimize your indexes for better performance:
class IndexOptimizer:
"""Optimizes hybrid search indexes."""
def __init__(self, vector_index: VectorIndex, bm25_index: BM25Index):
self.vector_index = vector_index
self.bm25_index = bm25_index
def optimize_vector_index(self):
"""Optimize vector index for production."""
# Most vector databases support quantization
# This reduces memory usage and improves speed with minimal accuracy loss
pass # Database-specific implementation
def optimize_bm25_index(self):
"""Optimize BM25 parameters."""
# BM25 parameters can be tuned for specific corpora
# Use cross-validation to find optimal k1 and b
pass
def monitor_performance(self, queries: List[str]) -> Dict:
"""Monitor retrieval performance."""
import time
results = {
"vector_latency": [],
"bm25_latency": [],
"hybrid_latency": [],
"result_overlap": []
}
for query in queries:
# Time each method
start = time.time()
v_results = self.vector_index.search(query, top_k=10)
results["vector_latency"].append(time.time() - start)
start = time.time()
b_results = self.bm25_index.search(query, top_k=10)
results["bm25_latency"].append(time.time() - start)
# Calculate overlap
v_ids = set(r["id"] for r in v_results)
b_ids = set(r["id"] for r in b_results)
overlap = len(v_ids & b_ids) / max(len(v_ids | b_ids), 1)
results["result_overlap"].append(overlap)
return {
"avg_vector_latency": sum(results["vector_latency"]) / len(results["vector_latency"]),
"avg_bm25_latency": sum(results["bm25_latency"]) / len(results["bm25_latency"]),
"avg_overlap": sum(results["result_overlap"]) / len(results["result_overlap"])
}
Caching Strategies
Implement caching for frequently asked queries:
from functools import lru_cache
import hashlib
class QueryCache:
"""Caches retrieval results for common queries."""
def __init__(self, retriever: HybridRetriever, max_size: int = 1000):
self.retriever = retriever
self.max_size = max_size
self.cache = {}
self.access_order = []
def _get_cache_key(self, query: str, top_k: int, alpha: float) -> str:
"""Generate cache key for query."""
key_string = f"{query}|{top_k}|{alpha}"
return hashlib.md5(key_string.encode()).hexdigest()
def search(self, query: str, top_k: int = 10, alpha: float = None) -> List[Dict]:
"""Search with caching."""
cache_key = self._get_cache_key(query, top_k, alpha or 0.5)
if cache_key in self.cache:
# Update access order
self.access_order.remove(cache_key)
self.access_order.append(cache_key)
return self.cache[cache_key]
# Cache miss - retrieve results
results = self.retriever.search(query, top_k, alpha)
# Add to cache
if len(self.cache) >= self.max_size:
# Evict least recently used
oldest = self.access_order.pop(0)
del self.cache[oldest]
self.cache[cache_key] = results
self.access_order.append(cache_key)
return results
Evaluation
Measuring Retrieval Quality
Evaluate your hybrid system:
class RetrievalEvaluator:
"""Evaluates hybrid retrieval quality."""
def __init__(self, retriever: HybridRetriever):
self.retriever = retriever
def evaluate(
self,
queries: List[Dict],
metrics: List[str] = None
) -> Dict:
"""Evaluate retrieval on a test set."""
metrics = metrics or ["precision", "recall", "mrr", "ndcg"]
results = {m: [] for m in metrics}
for query_data in queries:
query = query_data["query"]
relevant_ids = set(query_data["relevant_ids"])
retrieved = self.retriever.search(query, top_k=20)
retrieved_ids = set(r["id"] for r in retrieved)
# Calculate metrics
if "precision" in metrics:
precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids) if retrieved_ids else 0
results["precision"].append(precision)
if "recall" in metrics:
recall = len(retrieved_ids & relevant_ids) / len(relevant_ids) if relevant_ids else 0
results["recall"].append(recall)
if "mrr" in metrics:
mrr = self._mrr(retrieved, relevant_ids)
results["mrr"].append(mrr)
if "ndcg" in metrics:
ndcg = self._ndcg(retrieved, relevant_ids)
results["ndcg"].append(ndcg)
# Average results
return {m: sum(vals) / len(vals) for m, vals in results.items()}
def _mrr(self, retrieved: List[Dict], relevant: set) -> float:
"""Mean Reciprocal Rank."""
for rank, doc in enumerate(retrieved, 1):
if doc["id"] in relevant:
return 1.0 / rank
return 0.0
def _ndcg(self, retrieved: List[Dict], relevant: set, k: int = 10) -> float:
"""Normalized Discounted Cumulative Gain."""
dcg = 0.0
for rank, doc in enumerate(retrieved[:k], 1):
if doc["id"] in relevant:
dcg += 1.0 / (rank ** 0.5) # Using logarithmic discount
# Calculate IDCG
idcg = sum(1.0 / (r ** 0.5) for r in range(1, min(len(relevant), k) + 1))
return dcg / idcg if idcg > 0 else 0.0
A/B Testing
Compare different strategies in production:
class ABTestFramework:
"""Framework for A/B testing retrieval strategies."""
def __init__(self):
self.experiments = {}
def create_experiment(
self,
name: str,
strategy_a: Dict,
strategy_b: Dict
):
"""Create A/B test experiment."""
self.experiments[name] = {
"strategy_a": strategy_a,
"strategy_b": strategy_b,
"results_a": [],
"results_b": [],
"conversions_a": [],
"conversions_b": []
}
def record_result(
self,
experiment_name: str,
strategy: str,
query: str,
retrieved: List[Dict],
user_feedback: int = None
):
"""Record experiment result."""
exp = self.experiments[experiment_name]
if strategy == "a":
exp["results_a"].append({
"query": query,
"retrieved": retrieved,
"feedback": user_feedback
})
else:
exp["results_b"].append({
"query": query,
"retrieved": retrieved,
"feedback": user_feedback
})
def analyze_experiment(self, experiment_name: str) -> Dict:
"""Analyze A/B test results."""
exp = self.experiments[experiment_name]
def calc_metrics(results):
if not results:
return {"count": 0, "avg_feedback": 0}
feedbacks = [r["feedback"] for r in results if r["feedback"] is not None]
return {
"count": len(results),
"avg_feedback": sum(feedbacks) / len(feedbacks) if feedbacks else 0
}
return {
"strategy_a": calc_metrics(exp["results_a"]),
"strategy_b": calc_metrics(exp["results_b"])
}
External Resources
- Qdrant Vector Database - High-performance vector search with hybrid search support
- BM25 Algorithm Explanation - Elasticsearch’s BM25 documentation
- LangChain RAG Documentation - Official LangChain R
- BAG conceptsGE Embeddings - Open-source embedding models
- Faiss Library - Facebook’s similarity search library
- Hybrid Search Research Paper - Academic paper on dense and sparse retrieval
Conclusion
Hybrid search represents a significant advancement in retrieval technology, combining the semantic understanding of vector search with the precision of keyword matching. By implementing the patterns in this guide, you can build RAG systems that significantly outperform single-strategy approaches.
Key takeaways include understanding when each search strategy excels—vector search for semantic queries, keyword search for exact matches—and using hybrid approaches to get the best of both worlds. The implementation patterns provided offer a foundation you can adapt to your specific requirements.
Start with the simple weighted RRF approach, measure your results, and iterate. As you understand your query patterns and corpus characteristics, you can add sophistication through query routing, multi-stage retrieval, and learned ranking. The investment in building a robust hybrid search system will pay dividends in retrieval quality and user satisfaction.
Remember that the best hybrid strategy depends on your specific use case, query patterns, and data characteristics. Continuously evaluate and optimize based on real-world performance, and your RAG system will deliver superior results.
Comments