Skip to main content

RAG vs Fine-Tuning: When to Use Each and How to Implement Both

Created: March 3, 2026 Larry Qu 6 min read

Introduction

When you need an LLM to know about your company’s products, internal docs, or recent events, you have two main options: RAG (Retrieval-Augmented Generation) or Fine-Tuning. They solve different problems. Choosing the wrong one wastes time and money.

Quick decision:

  • Knowledge changes frequently → RAG
  • Behavior/style needs to change → Fine-Tuning
  • Both → Hybrid

What Each Approach Does

RAG:
  User query → search knowledge base → inject relevant docs into prompt → LLM answers
  The model's weights don't change. Knowledge lives in a vector database.

Fine-Tuning:
  Training examples → update model weights → model "knows" the new information
  The knowledge is baked into the model. No external retrieval needed.

RAG: Implementation

Basic RAG Pipeline

# pip install langchain langchain-openai chromadb
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader

# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(documents)

# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

# 5. Query
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

Advanced RAG: Hybrid Search + Reranking

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Hybrid: vector search + BM25 keyword search
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Combine both retrievers
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # weight keyword vs semantic
)

# Rerank results with a cross-encoder
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=4)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=ensemble_retriever
)

# Use in chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=compression_retriever,
)

RAG Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Evaluate RAG quality
test_questions = [
    "What is our return policy?",
    "How do I reset my password?",
]

# Generate answers with context
results = []
for q in test_questions:
    result = qa_chain.invoke({"query": q})
    results.append({
        "question": q,
        "answer": result["result"],
        "contexts": [doc.page_content for doc in result["source_documents"]],
    })

# Score with RAGAS
from datasets import Dataset
dataset = Dataset.from_list(results)
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(scores)
# faithfulness: 0.92  (answer grounded in retrieved docs)
# answer_relevancy: 0.88  (answer relevant to question)
# context_precision: 0.85  (retrieved docs are relevant)

Fine-Tuning: Implementation

When Fine-Tuning Makes Sense

✓ You need consistent output format (always return JSON, always use specific structure)
✓ You need domain-specific tone/style (legal language, medical terminology)
✓ You have 100+ high-quality examples of the exact behavior you want
✓ The knowledge is stable (doesn't change weekly)
✓ Latency matters (no retrieval step)

✗ Your knowledge base changes frequently
✗ You need source citations
✗ You have < 50 examples
✗ You need to update knowledge without retraining

OpenAI Fine-Tuning

from openai import OpenAI
import json

client = OpenAI()

# 1. Prepare training data (JSONL format)
training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
            {"role": "user", "content": "How do I cancel my subscription?"},
            {"role": "assistant", "content": "To cancel your subscription: 1) Log into your account, 2) Go to Settings > Billing, 3) Click 'Cancel Subscription', 4) Confirm cancellation. Your access continues until the end of the billing period."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
            {"role": "user", "content": "What payment methods do you accept?"},
            {"role": "assistant", "content": "We accept Visa, Mastercard, American Express, and PayPal. All payments are processed securely via Stripe."}
        ]
    },
    # ... need at least 10 examples, ideally 50-100+
]

# Save to JSONL
with open("training_data.jsonl", "w") as f:
    for example in training_examples:
        f.write(json.dumps(example) + "\n")

# 2. Upload training file
with open("training_data.jsonl", "rb") as f:
    response = client.files.create(file=f, purpose="fine-tune")
training_file_id = response.id
print(f"Training file: {training_file_id}")

# 3. Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    model="gpt-4o-mini-2024-07-18",  # cheaper base model
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 1.8,
    }
)
print(f"Fine-tuning job: {job.id}")

# 4. Monitor progress
import time
while True:
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {job.status}")
    if job.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

# 5. Use the fine-tuned model
fine_tuned_model = job.fine_tuned_model
print(f"Fine-tuned model: {fine_tuned_model}")

response = client.chat.completions.create(
    model=fine_tuned_model,
    messages=[
        {"role": "system", "content": "You are a customer support agent for Acme Corp."},
        {"role": "user", "content": "How do I update my billing address?"}
    ]
)
print(response.choices[0].message.content)

LoRA Fine-Tuning (Open Source Models)

# pip install transformers peft datasets trl
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset

# Load base model
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LoRA config — train only small adapter layers
lora_config = LoraConfig(
    r=16,              # rank of adapter matrices
    lora_alpha=32,     # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,215,093,760 || trainable%: 0.13%
# Only 0.13% of parameters are trained — much cheaper!

# Prepare dataset
def format_example(example):
    return f"<|system|>You are a helpful assistant.\n<|user|>{example['input']}\n<|assistant|>{example['output']}"

dataset = Dataset.from_list([
    {"input": "What is 2+2?", "output": "4"},
    # ... your training examples
])
dataset = dataset.map(lambda x: {"text": format_example(x)})

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=TrainingArguments(
        output_dir="./lora-output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        fp16=True,
    ),
)
trainer.train()

# Save adapter
model.save_pretrained("./my-lora-adapter")

Hybrid: RAG + Fine-Tuning

The best production systems often combine both:

# Fine-tuned model handles format/style
# RAG provides current knowledge

from openai import OpenAI

client = OpenAI()

def hybrid_query(question: str, vectorstore) -> str:
    # Step 1: Retrieve relevant context (RAG)
    docs = vectorstore.similarity_search(question, k=4)
    context = "\n\n".join([doc.page_content for doc in docs])

    # Step 2: Use fine-tuned model with retrieved context
    response = client.chat.completions.create(
        model="ft:gpt-4o-mini:your-org:your-model-id",  # fine-tuned model
        messages=[
            {
                "role": "system",
                "content": "You are Acme Corp's support agent. Use the provided context to answer questions. Always cite the source document."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ],
        temperature=0,
    )

    return response.choices[0].message.content

Decision Framework

1. Does your knowledge change more than monthly?
   YES → RAG (update the vector DB, not the model)
   NO  → Fine-tuning might work

2. Do you need source citations?
   YES → RAG (can return source documents)
   NO  → Either works

3. Do you need consistent output format/style?
   YES → Fine-tuning (or system prompt + RAG)
   NO  → RAG is simpler

4. How many training examples do you have?
   < 50  → RAG (fine-tuning needs more data)
   50+   → Fine-tuning is viable
   500+  → Fine-tuning will work well

5. Is latency critical?
   YES → Fine-tuning (no retrieval step)
   NO  → Either works

Cost Comparison

RAG costs:
  - Vector DB hosting: $50-500/month (Pinecone, Weaviate, Chroma Cloud)
  - Embedding API calls: ~$0.0001 per 1K tokens
  - LLM inference: standard API pricing
  - Total: mostly LLM inference cost

Fine-tuning costs:
  - Training: $0.008/1K tokens (GPT-4o-mini) = ~$8 for 1M tokens
  - Inference: 2x standard pricing for fine-tuned models
  - Total: higher per-query cost, but no vector DB

Rule of thumb:
  - Low query volume + large knowledge base → RAG
  - High query volume + stable knowledge → Fine-tuning

Resources

Comments

Share this article

Scan to read on mobile