Skip to main content
โšก Calmops

RAG Database Architecture: Building Production AI Systems

Introduction

Retrieval-Augmented Generation gives large language models access to external knowledge at inference time. Instead of relying solely on parametric memory, a RAG system retrieves relevant documents from a database and injects them into the prompt as context. This approach cuts hallucination, grounds answers in verifiable sources, and lets organizations build AI on top of their proprietary data without fine-tuning.

The database layer determines whether a RAG system works in production. Embedding storage, index configuration, metadata filtering, and query latency directly affect retrieval quality and user experience. A system with a perfect embedding model and a powerful LLM still fails if the database returns irrelevant documents or takes five seconds to answer. This article walks through the complete data flowโ€”document to chunk to embedding to index to retrieval to generationโ€”and covers the architectural decisions at each stage.

Why Databases Matter for RAG

A RAG system’s effectiveness depends on what the retriever finds. The database defines the search space: if the right information isn’t indexed or isn’t reachable through the query, the LLM cannot use it. Three database properties drive RAG success.

Recall measures how many relevant documents the system retrieves. Low recall means the LLM works with incomplete context and may hallucinate or give wrong answers. Databases that support hybrid search โ€” combining vector similarity with keyword matching and metadata filtering โ€” achieve higher recall than pure vector search alone.

Latency determines whether the system feels interactive. Retrieval must complete in a few hundred milliseconds to leave room for the LLM’s generation time. Index types like HNSW (Hierarchical Navigable Small World) trade a small amount of recall for order-of-magnitude speed improvements over brute-force search.

Freshness ensures the database reflects the latest information. Stale embeddings or missing documents produce outdated answers. The ingestion pipeline must support incremental updates without rebuilding the entire index.

Document Ingestion Pipeline

The ingestion pipeline converts raw documents into searchable embeddings. Every step affects retrieval quality.

Document Loading

Documents arrive in many formats โ€” PDF, HTML, Markdown, plain text, Word, Notion exports, database rows. LangChain and LlamaIndex provide document loaders for most common sources.

from langchain_community.document_loaders import PyPDFLoader, TextLoader, WebBaseLoader

pdf_docs = PyPDFLoader("annual-report-2025.pdf").load()
text_docs = TextLoader("notes.txt").load()
web_docs = WebBaseLoader("https://example.com/docs/guide").load()

print(f"Loaded {len(pdf_docs)} PDF pages, {len(text_docs)} text pages, {len(web_docs)} web pages")

Each loader returns Document objects with page content and metadata. Metadata carries the source URL, page number, load timestamp, and any extra fields you attach. This metadata becomes the foundation for filtering later.

Text Cleaning and Normalization

Raw text often contains artifacts โ€” headers, footers, HTML tags, repeated whitespace, Unicode variations. Cleaning removes noise that would otherwise pollute embeddings.

import re

def clean_text(text: str) -> str:
    text = re.sub(r"<[^>]+>", "", text)                # strip HTML
    text = re.sub(r"\s+", " ", text).strip()            # collapse whitespace
    text = re.sub(r"[^\x00-\x7F]+", "", text)           # remove non-ASCII
    return text

cleaned = [clean_text(doc.page_content) for doc in pdf_docs]

Aggressive cleaning can hurt domain-specific RAG systems โ€” code snippets need their syntax preserved, and mathematical notation needs Unicode. Apply cleaning selectively based on document type.

Chunking Strategies

Chunking splits documents into pieces small enough for embedding and retrieval. The chunk size directly affects both recall and relevance. Too small: each chunk lacks context and the retriever returns fragments that confuse the LLM. Too large: chunks contain multiple topics and the retriever returns irrelevant content mixed with relevant content.

Fixed-Size Chunking

The simplest approach splits text into equal-sized chunks with optional overlap. The overlap prevents ideas from being split across a boundary.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=128,
    separators=["\n\n", "\n", ".", " ", ""],
)

chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

The recursive splitter tries to break on paragraph boundaries first, then sentence boundaries, then word boundaries. This produces more coherent chunks than a naive fixed-size split.

Semantic Chunking

Semantic chunking uses document structure โ€” headings, sections, paragraphs โ€” as split points. It produces chunks that map to meaningful units of content.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
chunks = splitter.split_text(markdown_text)

for chunk in chunks:
    print(f"Section: {chunk.metadata.get('h1', '')} > {chunk.metadata.get('h2', '')}")

Semantic chunking preserves the document hierarchy in metadata, enabling queries that target specific sections. It works well for structured content like documentation, specifications, and reports.

Recursive Chunking with Custom Logic

Production systems often combine strategies. A common pattern: split by section (semantic), then apply recursive splitting on sections that exceed the maximum chunk size.

def chunk_document(doc, max_size=512, min_size=100):
    sections = split_by_headings(doc.page_content)

    chunks = []
    for section in sections:
        if len(section) < min_size:
            continue
        if len(section) <= max_size:
            chunks.append(section)
        else:
            sub_chunks = RecursiveCharacterTextSplitter(
                chunk_size=max_size,
                chunk_overlap=max_size // 4,
            ).split_text(section)
            chunks.extend(sub_chunks)

    return chunks

Embedding Generation

Embedding models convert text chunks into dense vector representations. The quality of these vectors determines whether semantically similar documents cluster together in vector space.

Choosing an Embedding Model

The embedding model’s capabilities should match your content domain and query patterns. OpenAI’s text-embedding-3-small offers 1536 dimensions with strong general-purpose performance. Cohere’s embed-english-v3.0 supports 1024 dimensions with multilingual variants. Open-source models like BAAI/bge-large-en-v1.5 and intfloat/e5-mistral-7b-instruct provide competitive quality without API dependencies.

from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536,
)

vectors = embeddings_model.embed_documents([c.page_content for c in chunks])
print(f"Generated {len(vectors)} embeddings, each {len(vectors[0])} dimensions")

Batch Processing for Scale

Embedding APIs have rate limits and cost per token. Batch processing reduces API calls and improves throughput.

import time
from typing import List

def embed_batch(texts: List[str], model, batch_size=64, delay=0.5):
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = model.embed_documents(batch)
        all_embeddings.extend(embeddings)
        print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)}")
        time.sleep(delay)
    return all_embeddings

Vector Database Selection

The vector database stores embeddings and executes similarity search. Each database makes different trade-offs between speed, accuracy, scalability, and operational complexity.

Vector Database Comparison

Feature pgvector (PostgreSQL) Pinecone Qdrant Weaviate Milvus
Index type IVFFlat, HNSW HNSW, (PQ) HNSW, (SQ) HNSW IVF, HNSW, (PQ, SQ)
Max dimensions 2000 (HNSW) / 8000 (IVFFlat) 20000 65536 1024 32768
Hosting Self-hosted Managed Self-hosted / Managed Self-hosted / Cloud Self-hosted / Managed
Hybrid search Yes (native SQL) No Yes Yes (hybrid) Yes (hybrid)
Metadata filtering Full SQL Pre-filter only Full filter Filter + BM25 Filter + scalar index
Horizontal scaling Read replicas, Citus Auto-scaling Cluster mode Replication Distributed (K8s)
Consistency model Strong Eventual Configurable Strong Configurable
Open source Yes (extension) No Yes Yes (BSL) Yes
Cloud offering RDS, Cloud SQL Yes Qdrant Cloud Weaviate Cloud Zilliz Cloud

pgvector works well when you already run PostgreSQL and want to avoid another database. It supports full SQL hybrid search โ€” vector similarity, keyword matching with full-text search, and metadata filtering in a single query. Pinecone is the easiest managed option but costs more at scale and lacks native hybrid search. Qdrant offers strong filtering performance with a clean API. Weaviate includes built-in vectorization modules that simplify the pipeline. Milvus scales to billions of vectors with distributed architecture.

Storing Embeddings with pgvector

pgvector brings vector search into PostgreSQL by adding a vector data type and index support.

-- Enable the extension
CREATE EXTENSION vector;

-- Create a table with vector column and metadata
CREATE TABLE document_chunks (
    id BIGSERIAL PRIMARY KEY,
    document_id TEXT NOT NULL,
    chunk_index INT NOT NULL,
    content TEXT NOT NULL,
    tokens INT NOT NULL DEFAULT 0,
    source_url TEXT,
    section_path TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    embedding vector(1536)
);

-- Create an HNSW index for approximate nearest neighbor search
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Create indexes for filtering
CREATE INDEX idx_document_id ON document_chunks (document_id);
CREATE INDEX idx_source_url ON document_chunks (source_url);

The HNSW parameters m (max connections per node) and ef_construction (search scope during build) control the trade-off between recall and index build time. Start with m=16, ef_construction=200 and tune based on your data.

Vector Search Queries

Find the most semantically similar chunks using cosine distance.

SELECT
    id,
    content,
    source_url,
    section_path,
    1 - (embedding <=> $query_vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> $query_vector
LIMIT 10;

The <=> operator computes cosine distance. Use vector_cosine_ops with the index for the fastest results. For L2 distance, use <-> and vector_l2_ops. For inner product, use <#> and vector_ip_ops.

Hybrid Search: Vector + Keyword + Metadata

Combine semantic similarity with full-text search and metadata filtering in a single query.

SELECT
    id,
    content,
    source_url,
    -- Hybrid score: weighted combination
    (1 - (embedding <=> $query_vector)) * 0.6
        + ts_rank(to_tsvector('english', content), plainto_tsquery('english', $query_text)) * 0.4
    AS hybrid_score
FROM document_chunks
WHERE
    source_url = $source_filter                      -- metadata filter
    AND created_at >= $date_filter                    -- date range filter
    AND to_tsvector('english', content) @@ plainto_tsquery('english', $query_text)  -- keyword filter
ORDER BY hybrid_score DESC
LIMIT 10;

Weighting vector similarity and keyword match requires tuning. A 60/40 split works as a starting point. Boost the vector weight for conceptual queries and boost keyword weight for exact-match queries like product names or error codes.

Re-Ranking

The initial vector search returns candidates that are “close” in embedding space, but the top results may not be the most relevant. Re-ranking applies a more expensive, higher-quality model to reorder the candidates.

Cross-Encoder Re-Ranking

Cross-encoder models score a query-document pair together (as opposed to bi-encoders that encode them separately). This produces more accurate relevance scores at the cost of more computation.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, documents: list, top_k: int = 5):
    pairs = [(query, doc.page_content) for doc in documents]
    scores = reranker.predict(pairs)

    scored = list(zip(documents, scores))
    scored.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, score in scored[:top_k]]

reranked_chunks = rerank("How do I deploy RAG on Kubernetes?", retrieved_chunks)

Apply re-ranking on the top 20-50 results from the vector search. Running a cross-encoder on thousands of candidates would be too slow, but scoring 50 pairs adds only a few hundred milliseconds.

Reciprocal Rank Fusion

When running multiple retrieval strategies (vector search, keyword search, different embedding models), RRF combines their rankings into a single result list.

def reciprocal_rank_fusion(results_by_strategy: list[list], k: int = 60):
    scores = {}
    for strategy_results in results_by_strategy:
        for rank, doc in enumerate(strategy_results):
            doc_id = doc.metadata.get("id") or doc.page_content[:50]
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, score in ranked]

RRF is model-agnostic and works with any combination of retrievers. It consistently outperforms any single retriever when the retrievers have complementary strengths.

Context Assembly and Prompt Construction

Retrieved documents must be assembled into a prompt that the LLM can use effectively.

Context Assembly

Select the most relevant documents, trim them to fit the model’s context window, and preserve their source information for citation.

def assemble_context(retrieved_chunks, max_tokens=4000):
    context_parts = []
    total_tokens = 0

    for chunk in retrieved_chunks:
        chunk_tokens = len(chunk.page_content.split())

        if total_tokens + chunk_tokens > max_tokens:
            break

        context_parts.append({
            "text": chunk.page_content,
            "source": chunk.metadata.get("source_url", ""),
            "section": chunk.metadata.get("section_path", ""),
        })
        total_tokens += chunk_tokens

    return context_parts

Prompt Construction

Structure the prompt so the LLM uses the provided context and cites sources.

def build_rag_prompt(query: str, context: list) -> str:
    context_text = "\n\n".join(
        f"[Source: {c['source']} - {c['section']}]\n{c['text']}"
        for c in context
    )

    prompt = f"""You are a technical assistant. Answer the question based on the provided context. If the context does not contain the answer, say so. Cite the source for each claim.

Context:
{context_text}

Question: {query}

Answer:"""
    return prompt

End-to-End RAG Pipeline

def rag_pipeline(query: str, embedding_model, vector_db, reranker, llm, top_k=20, rerank_k=5):
    # 1. Embed the query
    query_vector = embedding_model.embed_query(query)

    # 2. Retrieve candidates
    candidates = vector_db.similarity_search(query_vector, k=top_k)

    # 3. Re-rank
    top_docs = reranker.rerank(query, candidates, top_k=rerank_k)

    # 4. Assemble context
    context = assemble_context(top_docs)

    # 5. Build prompt
    prompt = build_rag_prompt(query, context)

    # 6. Generate
    response = llm.invoke(prompt)

    return {
        "response": response,
        "sources": [{"source": c["source"], "section": c["section"]} for c in context],
        "tokens_used": total_tokens(context),
    }

Evaluation Metrics

Measuring RAG quality requires metrics for both retrieval and generation.

Retrieval metrics compare retrieved documents against a ground-truth relevance set.

def retrieval_metrics(retrieved_ids, relevant_ids):
    retrieved_set = set(retrieved_ids)
    relevant_set = set(relevant_ids)

    true_positives = retrieved_set & relevant_set

    precision = len(true_positives) / len(retrieved_set) if retrieved_set else 0
    recall = len(true_positives) / len(relevant_set) if relevant_set else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0

    return {"precision": precision, "recall": recall, "f1": f1}

# Example: 8 relevant documents, 10 retrieved, 6 match
print(retrieval_metrics(range(10), range(8)))
# => {'precision': 0.6, 'recall': 0.75, 'f1': 0.67}

Generation metrics require evaluating whether the LLM answer is correct and grounded in the retrieved context. Automated evaluation approaches include:

def faithfulness_score(response: str, context: str, eval_llm) -> float:
    prompt = f"""Rate the faithfulness of this response on a scale of 0-1.
A response is faithful if every claim in the response is supported by the context.

Context:
{context}

Response:
{response}

Faithfulness score (0.0 - 1.0):"""
    score_text = eval_llm.invoke(prompt)
    return float(score_text.strip())

def answer_relevance(response: str, query: str, eval_llm) -> float:
    prompt = f"""Rate how well this response answers the query on a scale of 0-1.

Query: {query}
Response: {response}

Relevance score (0.0 - 1.0):"""
    score_text = eval_llm.invoke(prompt)
    return float(score_text.strip())

Track these metrics in production. A sudden drop in recall may indicate a stale index. A drop in faithfulness may indicate a shift in document content that requires re-embedding.

Common Pitfalls

Chunking without considering document semantics produces incoherent chunks that confuse both the retriever and the LLM. Always validate chunk quality by reading sample chunks before building the index.

Using the same embedding model for indexing and search is correct, but forgetting to normalize embeddings leads to incorrect distance calculations. Most cosine-similarity-based indexes assume unit-length vectors.

Relying on a single retrieval strategy misses relevant documents. Hybrid search consistently outperforms pure vector or pure keyword search across diverse query types. Always combine at least vector and keyword methods.

Ignoring metadata filtering means every query searches the entire collection. Applying filters reduces the search space, improves latency, and prevents irrelevant results from surfacing.

Skipping re-ranking degrades result quality at the top of the list. The vector database returns candidates that are close in embedding space, but the closest vectors aren’t always the most relevant. A lightweight cross-encoder on the top 20-50 results costs little and improves quality measurably.

Not monitoring retrieval quality in production means degradation goes unnoticed until users complain. Log queries, retrieved documents, and user feedback. Set up alerts for recall drops and latency spikes.

Optimization Tips

Tune HNSW parameters for your latency requirements. Lower ef_search speeds up queries at the cost of recall. Start with ef_search=40 and increase until recall meets your threshold.

Pre-compute and cache embeddings for static documents. Embedding generation is the slowest part of ingestion. Incremental updates should only re-embed changed documents.

Use batched ingestion for initial loads. A single-threaded loop embedding 100,000 documents one at a time takes hours. Batch size of 64-128 with concurrent workers cuts ingestion time by 10x.

Partition large collections by date, source, or category. Partition pruning limits search to relevant subsets and improves query performance. pgvector supports table partitioning natively.

Monitor vector index build times. Rebuilding after every update is expensive. Use incremental indexing (when available) or schedule rebuilds during low-traffic periods.

Resources

Comments