RAG vs Fine-Tuning: When to Use Each and How to Implement Both

Introduction

When you need an LLM to know about your company’s products, internal docs, or recent events, you have two main options: RAG (Retrieval-Augmented Generation) or Fine-Tuning. They solve different problems. Choosing the wrong one wastes time and money.

Quick decision:

Knowledge changes frequently → RAG
Behavior/style needs to change → Fine-Tuning
Both → Hybrid

What Each Approach Does

RAG:
  User query → search knowledge base → inject relevant docs into prompt → LLM answers
  The model's weights don't change. Knowledge lives in a vector database.

Fine-Tuning:
  Training examples → update model weights → model "knows" the new information
  The knowledge is baked into the model. No external retrieval needed.

RAG: Implementation

Basic RAG Pipeline

# pip install langchain langchain-openai chromadb
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader

# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(documents)

# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

# 5. Query
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

Advanced RAG: Hybrid Search + Reranking

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Hybrid: vector search + BM25 keyword search
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Combine both retrievers
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # weight keyword vs semantic
)

# Rerank results with a cross-encoder
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=4)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=ensemble_retriever
)

# Use in chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=compression_retriever,
)

RAG Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Evaluate RAG quality
test_questions = [
    "What is our return policy?",
    "How do I reset my password?",
]

# Generate answers with context
results = []
for q in test_questions:
    result = qa_chain.invoke({"query": q})
    results.append({
        "question": q,
        "answer": result["result"],
        "contexts": [doc.page_content for doc in result["source_documents"]],
    })

# Score with RAGAS
from datasets import Dataset
dataset = Dataset.from_list(results)
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(scores)
# faithfulness: 0.92  (answer grounded in retrieved docs)
# answer_relevancy: 0.88  (answer relevant to question)
# context_precision: 0.85  (retrieved docs are relevant)

Fine-Tuning: Implementation

When Fine-Tuning Makes Sense

✓ You need consistent output format (always return JSON, always use specific structure)
✓ You need domain-specific tone/style (legal language, medical terminology)
✓ You have 100+ high-quality examples of the exact behavior you want
✓ The knowledge is stable (doesn't change weekly)
✓ Latency matters (no retrieval step)

✗ Your knowledge base changes frequently
✗ You need source citations
✗ You have < 50 examples
✗ You need to update knowledge without retraining

OpenAI Fine-Tuning

from openai import OpenAI
import json

client = OpenAI()

# 1. Prepare training data (JSONL format)
training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
            {"role": "user", "content": "How do I cancel my subscription?"},
            {"role": "assistant", "content": "To cancel your subscription: 1) Log into your account, 2) Go to Settings > Billing, 3) Click 'Cancel Subscription', 4) Confirm cancellation. Your access continues until the end of the billing period."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
            {"role": "user", "content": "What payment methods do you accept?"},
            {"role": "assistant", "content": "We accept Visa, Mastercard, American Express, and PayPal. All payments are processed securely via Stripe."}
        ]
    },
    # ... need at least 10 examples, ideally 50-100+
]

# Save to JSONL
with open("training_data.jsonl", "w") as f:
    for example in training_examples:
        f.write(json.dumps(example) + "\n")

# 2. Upload training file
with open("training_data.jsonl", "rb") as f:
    response = client.files.create(file=f, purpose="fine-tune")
training_file_id = response.id
print(f"Training file: {training_file_id}")

# 3. Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    model="gpt-4o-mini-2024-07-18",  # cheaper base model
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 1.8,
    }
)
print(f"Fine-tuning job: {job.id}")

# 4. Monitor progress
import time
while True:
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {job.status}")
    if job.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

# 5. Use the fine-tuned model
fine_tuned_model = job.fine_tuned_model
print(f"Fine-tuned model: {fine_tuned_model}")

response = client.chat.completions.create(
    model=fine_tuned_model,
    messages=[
        {"role": "system", "content": "You are a customer support agent for Acme Corp."},
        {"role": "user", "content": "How do I update my billing address?"}
    ]
)
print(response.choices[0].message.content)

LoRA Fine-Tuning (Open Source Models)

# pip install transformers peft datasets trl
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset

# Load base model
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LoRA config — train only small adapter layers
lora_config = LoraConfig(
    r=16,              # rank of adapter matrices
    lora_alpha=32,     # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,215,093,760 || trainable%: 0.13%
# Only 0.13% of parameters are trained — much cheaper!

# Prepare dataset
def format_example(example):
    return f"<|system|>You are a helpful assistant.\n<|user|>{example['input']}\n<|assistant|>{example['output']}"

dataset = Dataset.from_list([
    {"input": "What is 2+2?", "output": "4"},
    # ... your training examples
])
dataset = dataset.map(lambda x: {"text": format_example(x)})

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=TrainingArguments(
        output_dir="./lora-output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        fp16=True,
    ),
)
trainer.train()

# Save adapter
model.save_pretrained("./my-lora-adapter")

Hybrid: RAG + Fine-Tuning

The best production systems often combine both:

# Fine-tuned model handles format/style
# RAG provides current knowledge

from openai import OpenAI

client = OpenAI()

def hybrid_query(question: str, vectorstore) -> str:
    # Step 1: Retrieve relevant context (RAG)
    docs = vectorstore.similarity_search(question, k=4)
    context = "\n\n".join([doc.page_content for doc in docs])

    # Step 2: Use fine-tuned model with retrieved context
    response = client.chat.completions.create(
        model="ft:gpt-4o-mini:your-org:your-model-id",  # fine-tuned model
        messages=[
            {
                "role": "system",
                "content": "You are Acme Corp's support agent. Use the provided context to answer questions. Always cite the source document."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ],
        temperature=0,
    )

    return response.choices[0].message.content

Decision Framework

1. Does your knowledge change more than monthly?
   YES → RAG (update the vector DB, not the model)
   NO  → Fine-tuning might work

2. Do you need source citations?
   YES → RAG (can return source documents)
   NO  → Either works

3. Do you need consistent output format/style?
   YES → Fine-tuning (or system prompt + RAG)
   NO  → RAG is simpler

4. How many training examples do you have?
   < 50  → RAG (fine-tuning needs more data)
   50+   → Fine-tuning is viable
   500+  → Fine-tuning will work well

5. Is latency critical?
   YES → Fine-tuning (no retrieval step)
   NO  → Either works

Cost Comparison

RAG costs:
  - Vector DB hosting: $50-500/month (Pinecone, Weaviate, Chroma Cloud)
  - Embedding API calls: ~$0.0001 per 1K tokens
  - LLM inference: standard API pricing
  - Total: mostly LLM inference cost

Fine-tuning costs:
  - Training: $0.008/1K tokens (GPT-4o-mini) = ~$8 for 1M tokens
  - Inference: 2x standard pricing for fine-tuned models
  - Total: higher per-query cost, but no vector DB

Rule of thumb:
  - Low query volume + large knowledge base → RAG
  - High query volume + stable knowledge → Fine-tuning