Skip to main content
โšก Calmops

Some Text Retrieval Toolkits

Text Retrieval Fundamentals

Text retrieval โ€” finding documents relevant to a user’s information need from a large collection โ€” is the foundation of search engines, question answering, and RAG pipelines. Every retrieval system faces the same core challenge: given a query, return the most relevant documents ranked by usefulness.

This article covers the essential concepts, algorithms, and implementations: from Boolean retrieval and TF-IDF to modern neural methods like Dense Passage Retrieval and ColBERT.

The Inverted Index

The central data structure in text retrieval is the inverted index. Instead of scanning every document for each query, the index maps each unique term to the list of documents containing it.

from collections import defaultdict
import math

def build_inverted_index(docs):
    index = defaultdict(list)
    for doc_id, text in enumerate(docs):
        for token in text.lower().split():
            index[token].append(doc_id)
    return index

docs = [
    "the quick brown fox",
    "the lazy dog",
    "the quick fox jumps",
]
index = build_inverted_index(docs)
print(dict(index))
# => {'the': [0, 1, 2], 'quick': [0, 2], 'brown': [0],
#     'fox': [0, 2], 'lazy': [1], 'dog': [1], 'jumps': [2]}

A production index stores not just document IDs but also term frequency, position, and field metadata to support phrase queries and scoring.

Tokenization and Stemming

Raw text must be normalized before indexing. Tokenization splits text into terms; stemming reduces words to their root form.

import re
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def tokenize(text):
    return re.findall(r"\w+", text.lower())

def stem_tokens(tokens):
    return [stemmer.stem(t) for t in tokens]

raw = "The cats are running swiftly"
tokens = tokenize(raw)
stems = stem_tokens(tokens)
print(stems)
# => ['the', 'cat', 'are', 'run', 'swiftli']

Common stemming algorithms include Porter (aggressive), Snowball, and Lancaster. For better linguistic accuracy, many modern systems use lemmatization (e.g., spaCy, Stanford CoreNLP), which returns valid dictionary words.

Boolean Retrieval Model

The simplest retrieval model treats queries as Boolean expressions of terms connected by AND, OR, and NOT. Documents either match or they don’t โ€” there is no ranking.

def boolean_search(query_tokens, index, operator="AND"):
    if operator == "AND":
        if not query_tokens:
            return set()
        result = set(index.get(query_tokens[0], []))
        for t in query_tokens[1:]:
            result &= set(index.get(t, []))
        return result
    elif operator == "OR":
        result = set()
        for t in query_tokens:
            result |= set(index.get(t, []))
        return result

query = ["quick", "fox"]
print(boolean_search(query, index, "AND"))
# => {0, 2}

Boolean retrieval is precise but inflexible โ€” a single missing term returns nothing. Most modern systems blend Boolean operators into ranked retrieval.

TF-IDF: Term Frequencyโ€“Inverse Document Frequency

TF-IDF scores a document against a query by weighing term frequency against how rare the term is across the collection.

def compute_tfidf(docs):
    N = len(docs)
    idf = {}
    tf = []

    for i, text in enumerate(docs):
        tokens = text.lower().split()
        term_counts = defaultdict(int)
        for t in tokens:
            term_counts[t] += 1
        tf.append(term_counts)

    for i, text in enumerate(docs):
        terms = set(text.lower().split())
        for t in terms:
            idf[t] = math.log(N / sum(1 for d in docs if t in d.lower().split()))

    return tf, idf

def tfidf_score(query, doc_idx, tf, idf):
    score = 0.0
    for term in query.lower().split():
        if term in idf:
            score += tf[doc_idx].get(term, 0) * idf[term]
    return score

tf, idf = compute_tfidf(docs)
query = "quick fox"
for i in range(len(docs)):
    print(f"Doc {i}: {tfidf_score(query, i, tf, idf):.3f}")

# => Doc 0: 0.405, Doc 1: 0.000, Doc 2: 0.405

TF-IDF remains the baseline for many retrieval tasks. It is simple, interpretable, and effective on clean, well-structured text.

BM25: The Industry Standard

BM25 (Best Matching 25) improves on TF-IDF with saturation and document-length normalization. It powers Lucene, Elasticsearch, and Solr.

def bm25_score(query, doc_idx, tf, doc_lengths, avg_dl, idf, k1=1.5, b=0.75):
    score = 0.0
    dl = doc_lengths[doc_idx]
    for term in query.lower().split():
        if term in idf:
            tf_td = tf[doc_idx].get(term, 0)
            numerator = tf_td * (k1 + 1)
            denominator = tf_td + k1 * (1 - b + b * (dl / avg_dl))
            score += idf[term] * numerator / denominator
    return score

doc_lengths = [len(d.lower().split()) for d in docs]
avg_dl = sum(doc_lengths) / len(doc_lengths)
idf_bm25 = {t: math.log((len(docs) - sum(1 for d in docs if t in d.lower().split()) + 0.5)
                         / (sum(1 for d in docs if t in d.lower().split()) + 0.5) + 1)
            for t in set(" ".join(docs).split())}

for i in range(len(docs)):
    print(f"Doc {i}: {bm25_score(query, i, tf, doc_lengths, avg_dl, idf_bm25):.3f}")
# => Doc 0: 0.727, Doc 1: 0.000, Doc 2: 0.727

BM25’s key insight: term frequency saturates (repeating “the” 100 times doesn’t help), and shorter documents are preferred over long ones when they match equally.

Vector Space Model and Cosine Similarity

Documents and queries are represented as vectors in a high-dimensional term space. Relevance is the cosine of the angle between them.

import numpy as np

def build_vector(doc_idx, tf, idf, vocab):
    vec = np.zeros(len(vocab))
    for j, term in enumerate(vocab):
        vec[j] = tf[doc_idx].get(term, 0) * idf.get(term, 0)
    return vec

def cosine_similarity(vec_a, vec_b):
    dot = np.dot(vec_a, vec_b)
    norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    return dot / norm if norm else 0.0

vocab = sorted(idf.keys())
query_vec = build_vector(-1, [{t: 1 for t in query.lower().split()}], idf, vocab)
for i in range(len(docs)):
    doc_vec = build_vector(i, tf, idf, vocab)
    sim = cosine_similarity(doc_vec, doc_vec)
    print(f"Doc {i} cosine: {cosine_similarity(query_vec, doc_vec):.3f}")
# => Doc 0: 0.577, Doc 1: 0.000, Doc 2: 0.577

Cosine similarity normalizes for document length โ€” two documents with the same content proportions score equally regardless of size.

Building a Search Engine with Whoosh

Whoosh is a pure-Python search library that implements BM25, an inverted index, and a query parser out of the box.

import os, shutil
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT
from whoosh.qparser import QueryParser

schema = Schema(title=TEXT(stored=True), content=TEXT)
index_dir = "indexdir"
if os.path.exists(index_dir):
    shutil.rmtree(index_dir)
os.mkdir(index_dir)

ix = create_in(index_dir, schema)
writer = ix.writer()
writer.add_document(title="Doc 1", content="the quick brown fox")
writer.add_document(title="Doc 2", content="the lazy dog")
writer.add_document(title="Doc 3", content="the quick fox jumps")
writer.commit()

with ix.searcher() as searcher:
    parser = QueryParser("content", ix.schema)
    query_obj = parser.parse("quick fox")
    results = searcher.search(query_obj, limit=3)
    for r in results:
        print(f"{r['title']} (score: {r.score:.3f})")
# => Doc 1 (score: 0.242), Doc 3 (score: 0.242)

Whoosh is ideal for prototyping and small-scale applications. For production, Elasticsearch provides the same BM25-based ranking at web scale.

Semantic Search with Embeddings

Lexical methods fail on vocabulary mismatch โ€” “automobile” doesn’t match “car”. Dense embeddings solve this by mapping queries and documents into a shared semantic space.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

docs = [
    "A cat chases mice",
    "Dogs are loyal pets",
    "Automobiles require fuel",
]
doc_embeddings = model.encode(docs)

query = "car engine maintenance"
query_embedding = model.encode(query)

scores = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)

for doc, score in zip(docs, scores):
    print(f"{score:.3f}  {doc}")
# => 0.136  A cat chases mice
# => 0.087  Dogs are loyal pets
# => 0.341  Automobiles require fuel

The embedding model captures semantic relatedness โ€” “car” and “automobile” are close in vector space even though they share no tokens.

Neural Retrieval: DPR and ColBERT

Dense Passage Retrieval (DPR) uses dual encoders: one BERT encoder for queries and another for passages. Documents are pre-encoded, and retrieval is a nearest-neighbor search with FAISS.

ColBERT uses late interaction: each query token attends to each document token independently, producing a more nuanced relevance score at the cost of more computation.

from transformers import AutoTokenizer, AutoModel
import torch

# DPR-style dual encoder (simplified)
tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
model = AutoModel.from_pretrained("facebook/dpr-question_encoder-multiset-base")

def encode(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    outputs = model(**inputs)
    return outputs.pooler_output.detach().numpy()

query_vec = encode("car repair costs")
doc_vecs = [encode("automobile maintenance expenses"), encode("dog walking tips")]

for d, v in zip(docs, doc_vecs):
    sim = np.dot(query_vec, v.T) / (np.linalg.norm(query_vec) * np.linalg.norm(v))
    print(f"{sim[0][0]:.3f}  {d[:40]}")
# => 0.642  automobile maintenance expenses
# => 0.218  dog walking tips

DPR and ColBERT dominate the leaderboards on MS MARCO and BEIR. They require GPUs for training but are practical with FAISS for fast approximate nearest-neighbor search.

Hybrid Search: Lexical + Semantic

The best retrieval systems combine lexical (BM25) and semantic (dense) signals. BM25 excels at exact term matching and handles rare terms well; dense retrieval captures synonyms and paraphrases.

def hybrid_score(query, doc_idx, bm25_score, dense_score, alpha=0.3):
    return alpha * bm25_score + (1 - alpha) * dense_score

# alpha near 1.0 favors exact match; near 0.0 favors semantics

Elasticsearch supports hybrid search natively (kNN + BM25), and libraries like hybridse or ranx make it easy to fuse results from separate retrievers.

Evaluation Metrics

Offline evaluation measures how well a system retrieves relevant documents.

def precision_at_k(relevant, retrieved, k):
    return len(set(relevant) & set(retrieved[:k])) / k

def recall_at_k(relevant, retrieved, k):
    return len(set(relevant) & set(retrieved[:k])) / len(relevant)

def average_precision(relevant, retrieved):
    ap = 0.0
    hits = 0
    for i, doc in enumerate(retrieved):
        if doc in relevant:
            hits += 1
            ap += hits / (i + 1)
    return ap / len(relevant) if relevant else 0.0

def reciprocal_rank(relevant, retrieved):
    for i, doc in enumerate(retrieved):
        if doc in relevant:
            return 1.0 / (i + 1)
    return 0.0

relevant = {0, 2}
retrieved = [2, 3, 0, 1]
print(f"P@1: {precision_at_k(relevant, retrieved, 1)}")   # 1.0
print(f"P@3: {precision_at_k(relevant, retrieved, 3)}")   # 0.667
print(f"R@3: {recall_at_k(relevant, retrieved, 3)}")      # 1.0
print(f"MAP: {average_precision(relevant, retrieved):.3f}")  # 0.833
print(f"MRR: {reciprocal_rank(relevant, retrieved):.3f}")    # 1.0

NDCG (Normalized Discounted Cumulative Gain) handles graded relevance (not just binary). It discounts gains from lower-ranked results and normalizes by the ideal ranking.

Relevance Feedback and Query Expansion

Rocchio relevance feedback adjusts the query vector toward relevant documents and away from non-relevant ones:

def rocchio(query_vec, relevant_vec, non_relevant_vec, alpha=1.0, beta=0.75, gamma=0.15):
    new_query = alpha * query_vec
    if len(relevant_vec):
        new_query += beta * np.mean(relevant_vec, axis=0)
    if len(non_relevant_vec):
        new_query -= gamma * np.mean(non_relevant_vec, axis=0)
    return new_query

Pseudo-relevance feedback assumes the top-k retrieved results are relevant and expands the query with their top terms. Query expansion using word embeddings or a thesaurus (e.g., WordNet) adds synonyms before retrieval.

Traditional vs Neural Retrieval

Aspect Traditional (BM25, TF-IDF) Neural (DPR, ColBERT)
Matching Exact term overlap Semantic/vector similarity
Vocabulary gap Fails on synonym mismatch Handles naturally
Training data None required Requires relevance labels
Index size Compact (inverted lists) Large (dense vectors)
Query latency Milliseconds 10-100ms (ANN search)
Hardware CPU GPU recommended
Interpretability High (see which terms match) Low (black-box embeddings)
Rare terms Excellent Poor (rare terms have weak vectors)
Cross-lingual No Yes (multilingual models)
Domain adaptation Zero-shot Fine-tuning needed

Resources

Comments