Text Retrieval Fundamentals
Text retrieval โ finding documents relevant to a user’s information need from a large collection โ is the foundation of search engines, question answering, and RAG pipelines. Every retrieval system faces the same core challenge: given a query, return the most relevant documents ranked by usefulness.
This article covers the essential concepts, algorithms, and implementations: from Boolean retrieval and TF-IDF to modern neural methods like Dense Passage Retrieval and ColBERT.
The Inverted Index
The central data structure in text retrieval is the inverted index. Instead of scanning every document for each query, the index maps each unique term to the list of documents containing it.
from collections import defaultdict
import math
def build_inverted_index(docs):
index = defaultdict(list)
for doc_id, text in enumerate(docs):
for token in text.lower().split():
index[token].append(doc_id)
return index
docs = [
"the quick brown fox",
"the lazy dog",
"the quick fox jumps",
]
index = build_inverted_index(docs)
print(dict(index))
# => {'the': [0, 1, 2], 'quick': [0, 2], 'brown': [0],
# 'fox': [0, 2], 'lazy': [1], 'dog': [1], 'jumps': [2]}
A production index stores not just document IDs but also term frequency, position, and field metadata to support phrase queries and scoring.
Tokenization and Stemming
Raw text must be normalized before indexing. Tokenization splits text into terms; stemming reduces words to their root form.
import re
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
def tokenize(text):
return re.findall(r"\w+", text.lower())
def stem_tokens(tokens):
return [stemmer.stem(t) for t in tokens]
raw = "The cats are running swiftly"
tokens = tokenize(raw)
stems = stem_tokens(tokens)
print(stems)
# => ['the', 'cat', 'are', 'run', 'swiftli']
Common stemming algorithms include Porter (aggressive), Snowball, and Lancaster. For better linguistic accuracy, many modern systems use lemmatization (e.g., spaCy, Stanford CoreNLP), which returns valid dictionary words.
Boolean Retrieval Model
The simplest retrieval model treats queries as Boolean expressions of terms connected by AND, OR, and NOT. Documents either match or they don’t โ there is no ranking.
def boolean_search(query_tokens, index, operator="AND"):
if operator == "AND":
if not query_tokens:
return set()
result = set(index.get(query_tokens[0], []))
for t in query_tokens[1:]:
result &= set(index.get(t, []))
return result
elif operator == "OR":
result = set()
for t in query_tokens:
result |= set(index.get(t, []))
return result
query = ["quick", "fox"]
print(boolean_search(query, index, "AND"))
# => {0, 2}
Boolean retrieval is precise but inflexible โ a single missing term returns nothing. Most modern systems blend Boolean operators into ranked retrieval.
TF-IDF: Term FrequencyโInverse Document Frequency
TF-IDF scores a document against a query by weighing term frequency against how rare the term is across the collection.
def compute_tfidf(docs):
N = len(docs)
idf = {}
tf = []
for i, text in enumerate(docs):
tokens = text.lower().split()
term_counts = defaultdict(int)
for t in tokens:
term_counts[t] += 1
tf.append(term_counts)
for i, text in enumerate(docs):
terms = set(text.lower().split())
for t in terms:
idf[t] = math.log(N / sum(1 for d in docs if t in d.lower().split()))
return tf, idf
def tfidf_score(query, doc_idx, tf, idf):
score = 0.0
for term in query.lower().split():
if term in idf:
score += tf[doc_idx].get(term, 0) * idf[term]
return score
tf, idf = compute_tfidf(docs)
query = "quick fox"
for i in range(len(docs)):
print(f"Doc {i}: {tfidf_score(query, i, tf, idf):.3f}")
# => Doc 0: 0.405, Doc 1: 0.000, Doc 2: 0.405
TF-IDF remains the baseline for many retrieval tasks. It is simple, interpretable, and effective on clean, well-structured text.
BM25: The Industry Standard
BM25 (Best Matching 25) improves on TF-IDF with saturation and document-length normalization. It powers Lucene, Elasticsearch, and Solr.
def bm25_score(query, doc_idx, tf, doc_lengths, avg_dl, idf, k1=1.5, b=0.75):
score = 0.0
dl = doc_lengths[doc_idx]
for term in query.lower().split():
if term in idf:
tf_td = tf[doc_idx].get(term, 0)
numerator = tf_td * (k1 + 1)
denominator = tf_td + k1 * (1 - b + b * (dl / avg_dl))
score += idf[term] * numerator / denominator
return score
doc_lengths = [len(d.lower().split()) for d in docs]
avg_dl = sum(doc_lengths) / len(doc_lengths)
idf_bm25 = {t: math.log((len(docs) - sum(1 for d in docs if t in d.lower().split()) + 0.5)
/ (sum(1 for d in docs if t in d.lower().split()) + 0.5) + 1)
for t in set(" ".join(docs).split())}
for i in range(len(docs)):
print(f"Doc {i}: {bm25_score(query, i, tf, doc_lengths, avg_dl, idf_bm25):.3f}")
# => Doc 0: 0.727, Doc 1: 0.000, Doc 2: 0.727
BM25’s key insight: term frequency saturates (repeating “the” 100 times doesn’t help), and shorter documents are preferred over long ones when they match equally.
Vector Space Model and Cosine Similarity
Documents and queries are represented as vectors in a high-dimensional term space. Relevance is the cosine of the angle between them.
import numpy as np
def build_vector(doc_idx, tf, idf, vocab):
vec = np.zeros(len(vocab))
for j, term in enumerate(vocab):
vec[j] = tf[doc_idx].get(term, 0) * idf.get(term, 0)
return vec
def cosine_similarity(vec_a, vec_b):
dot = np.dot(vec_a, vec_b)
norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
return dot / norm if norm else 0.0
vocab = sorted(idf.keys())
query_vec = build_vector(-1, [{t: 1 for t in query.lower().split()}], idf, vocab)
for i in range(len(docs)):
doc_vec = build_vector(i, tf, idf, vocab)
sim = cosine_similarity(doc_vec, doc_vec)
print(f"Doc {i} cosine: {cosine_similarity(query_vec, doc_vec):.3f}")
# => Doc 0: 0.577, Doc 1: 0.000, Doc 2: 0.577
Cosine similarity normalizes for document length โ two documents with the same content proportions score equally regardless of size.
Building a Search Engine with Whoosh
Whoosh is a pure-Python search library that implements BM25, an inverted index, and a query parser out of the box.
import os, shutil
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT
from whoosh.qparser import QueryParser
schema = Schema(title=TEXT(stored=True), content=TEXT)
index_dir = "indexdir"
if os.path.exists(index_dir):
shutil.rmtree(index_dir)
os.mkdir(index_dir)
ix = create_in(index_dir, schema)
writer = ix.writer()
writer.add_document(title="Doc 1", content="the quick brown fox")
writer.add_document(title="Doc 2", content="the lazy dog")
writer.add_document(title="Doc 3", content="the quick fox jumps")
writer.commit()
with ix.searcher() as searcher:
parser = QueryParser("content", ix.schema)
query_obj = parser.parse("quick fox")
results = searcher.search(query_obj, limit=3)
for r in results:
print(f"{r['title']} (score: {r.score:.3f})")
# => Doc 1 (score: 0.242), Doc 3 (score: 0.242)
Whoosh is ideal for prototyping and small-scale applications. For production, Elasticsearch provides the same BM25-based ranking at web scale.
Semantic Search with Embeddings
Lexical methods fail on vocabulary mismatch โ “automobile” doesn’t match “car”. Dense embeddings solve this by mapping queries and documents into a shared semantic space.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
"A cat chases mice",
"Dogs are loyal pets",
"Automobiles require fuel",
]
doc_embeddings = model.encode(docs)
query = "car engine maintenance"
query_embedding = model.encode(query)
scores = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
for doc, score in zip(docs, scores):
print(f"{score:.3f} {doc}")
# => 0.136 A cat chases mice
# => 0.087 Dogs are loyal pets
# => 0.341 Automobiles require fuel
The embedding model captures semantic relatedness โ “car” and “automobile” are close in vector space even though they share no tokens.
Neural Retrieval: DPR and ColBERT
Dense Passage Retrieval (DPR) uses dual encoders: one BERT encoder for queries and another for passages. Documents are pre-encoded, and retrieval is a nearest-neighbor search with FAISS.
ColBERT uses late interaction: each query token attends to each document token independently, producing a more nuanced relevance score at the cost of more computation.
from transformers import AutoTokenizer, AutoModel
import torch
# DPR-style dual encoder (simplified)
tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
model = AutoModel.from_pretrained("facebook/dpr-question_encoder-multiset-base")
def encode(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
return outputs.pooler_output.detach().numpy()
query_vec = encode("car repair costs")
doc_vecs = [encode("automobile maintenance expenses"), encode("dog walking tips")]
for d, v in zip(docs, doc_vecs):
sim = np.dot(query_vec, v.T) / (np.linalg.norm(query_vec) * np.linalg.norm(v))
print(f"{sim[0][0]:.3f} {d[:40]}")
# => 0.642 automobile maintenance expenses
# => 0.218 dog walking tips
DPR and ColBERT dominate the leaderboards on MS MARCO and BEIR. They require GPUs for training but are practical with FAISS for fast approximate nearest-neighbor search.
Hybrid Search: Lexical + Semantic
The best retrieval systems combine lexical (BM25) and semantic (dense) signals. BM25 excels at exact term matching and handles rare terms well; dense retrieval captures synonyms and paraphrases.
def hybrid_score(query, doc_idx, bm25_score, dense_score, alpha=0.3):
return alpha * bm25_score + (1 - alpha) * dense_score
# alpha near 1.0 favors exact match; near 0.0 favors semantics
Elasticsearch supports hybrid search natively (kNN + BM25), and libraries like hybridse or ranx make it easy to fuse results from separate retrievers.
Evaluation Metrics
Offline evaluation measures how well a system retrieves relevant documents.
def precision_at_k(relevant, retrieved, k):
return len(set(relevant) & set(retrieved[:k])) / k
def recall_at_k(relevant, retrieved, k):
return len(set(relevant) & set(retrieved[:k])) / len(relevant)
def average_precision(relevant, retrieved):
ap = 0.0
hits = 0
for i, doc in enumerate(retrieved):
if doc in relevant:
hits += 1
ap += hits / (i + 1)
return ap / len(relevant) if relevant else 0.0
def reciprocal_rank(relevant, retrieved):
for i, doc in enumerate(retrieved):
if doc in relevant:
return 1.0 / (i + 1)
return 0.0
relevant = {0, 2}
retrieved = [2, 3, 0, 1]
print(f"P@1: {precision_at_k(relevant, retrieved, 1)}") # 1.0
print(f"P@3: {precision_at_k(relevant, retrieved, 3)}") # 0.667
print(f"R@3: {recall_at_k(relevant, retrieved, 3)}") # 1.0
print(f"MAP: {average_precision(relevant, retrieved):.3f}") # 0.833
print(f"MRR: {reciprocal_rank(relevant, retrieved):.3f}") # 1.0
NDCG (Normalized Discounted Cumulative Gain) handles graded relevance (not just binary). It discounts gains from lower-ranked results and normalizes by the ideal ranking.
Relevance Feedback and Query Expansion
Rocchio relevance feedback adjusts the query vector toward relevant documents and away from non-relevant ones:
def rocchio(query_vec, relevant_vec, non_relevant_vec, alpha=1.0, beta=0.75, gamma=0.15):
new_query = alpha * query_vec
if len(relevant_vec):
new_query += beta * np.mean(relevant_vec, axis=0)
if len(non_relevant_vec):
new_query -= gamma * np.mean(non_relevant_vec, axis=0)
return new_query
Pseudo-relevance feedback assumes the top-k retrieved results are relevant and expands the query with their top terms. Query expansion using word embeddings or a thesaurus (e.g., WordNet) adds synonyms before retrieval.
Traditional vs Neural Retrieval
| Aspect | Traditional (BM25, TF-IDF) | Neural (DPR, ColBERT) |
|---|---|---|
| Matching | Exact term overlap | Semantic/vector similarity |
| Vocabulary gap | Fails on synonym mismatch | Handles naturally |
| Training data | None required | Requires relevance labels |
| Index size | Compact (inverted lists) | Large (dense vectors) |
| Query latency | Milliseconds | 10-100ms (ANN search) |
| Hardware | CPU | GPU recommended |
| Interpretability | High (see which terms match) | Low (black-box embeddings) |
| Rare terms | Excellent | Poor (rare terms have weak vectors) |
| Cross-lingual | No | Yes (multilingual models) |
| Domain adaptation | Zero-shot | Fine-tuning needed |
Resources
- Introduction to Information Retrieval โ Manning, Raghavan, Schรผtze (textbook)
- BM25 Wikipedia โ algorithm details and variants
- Dense Passage Retrieval (Karpukhin et al., 2020) โ original DPR paper
- ColBERT (Khattab & Zaharia, 2020) โ late interaction retrieval
- BEIR Benchmark โ evaluation suite for zero-shot retrieval
- FAISS โ Facebook’s similarity search library
- Whoosh โ pure-Python search library
- Elasticsearch BM25 โ BM25 configuration in production
- Sentence-Transformers โ dense embedding models
- TREC โ Text REtrieval Conference datasets and tracks
Comments