Skip to main content

RAG vs Fine-Tuning: When to Use Each and How to Implement Both

Published: August 13, 2025 Updated: June 22, 2026 Larry Qu 6 min read

Introduction

When you need an LLM to know about your company’s products, internal docs, or recent events, you have two main options: RAG (Retrieval-Augmented Generation) or Fine-Tuning. They solve different problems. Choosing the wrong one wastes time and money.

Quick decision:

  • Knowledge changes frequently → RAG
  • Behavior/style needs to change → Fine-Tuning
  • Both → Hybrid

What Each Approach Does

RAG: User query → search knowledge base → inject relevant docs into prompt → LLM answers. The model’s weights don’t change. Knowledge lives in a vector database.

Fine-Tuning: Training examples → update model weights → model “knows” the new information. The knowledge is baked into the model. No external retrieval needed at inference time.

RAG: Implementation

Basic RAG Pipeline

# pip install langchain langchain-openai chromadb
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader

# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(documents)

# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

# 5. Query
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

Advanced RAG: Hybrid Search + Reranking

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Hybrid: vector search + BM25 keyword search
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Combine both retrievers
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # weight keyword vs semantic
)

# Rerank results with a cross-encoder
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=4)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=ensemble_retriever
)

# Use in chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=compression_retriever,
)

RAG Evaluation

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics.collections import Faithfulness, ResponseRelevancy, ContextPrecision
from ragas import evaluate, EvaluationDataset
from ragas.llms import llm_factory
from openai import AsyncOpenAI

llm = llm_factory("gpt-4o-mini", client=AsyncOpenAI())

# Build samples from your QA chain outputs
samples = []
for q in test_questions:
    result = qa_chain.invoke({"query": q})
    samples.append(SingleTurnSample(
        user_input=q,
        response=result["result"],
        retrieved_contexts=[doc.page_content for doc in result["source_documents"]],
    ))

dataset = EvaluationDataset(samples=samples)
scores = evaluate(
    dataset,
    metrics=[
        Faithfulness(llm=llm),
        ResponseRelevancy(llm=llm),
        ContextPrecision(llm=llm),
    ],
)
print(scores)
# faithfulness: 0.92  (answer grounded in retrieved docs)
# answer_relevancy: 0.88  (answer relevant to question)
# context_precision: 0.85  (retrieved docs are relevant)

Fine-Tuning: Implementation

When Fine-Tuning Makes Sense

Fine-tuning is the right choice when:

  • You need consistent output format (always return JSON, always use a specific structure)
  • You need domain-specific tone or style (legal language, medical terminology)
  • You have 100+ high-quality examples of the exact behavior you want
  • The knowledge is stable and doesn’t change weekly
  • Latency matters and you want to eliminate the retrieval step

Fine-tuning is not the right choice when your knowledge base changes frequently, you need source citations, you have fewer than 50 examples, or you need to update knowledge without retraining.

OpenAI Fine-Tuning

from openai import OpenAI
import json

client = OpenAI()

# 1. Prepare training data (JSONL format)
training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
            {"role": "user", "content": "How do I cancel my subscription?"},
            {"role": "assistant", "content": "To cancel your subscription: 1) Log into your account, 2) Go to Settings > Billing, 3) Click 'Cancel Subscription', 4) Confirm cancellation. Your access continues until the end of the billing period."}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
            {"role": "user", "content": "What payment methods do you accept?"},
            {"role": "assistant", "content": "We accept Visa, Mastercard, American Express, and PayPal. All payments are processed securely via Stripe."}
        ]
    },
    # ... need at least 10 examples, ideally 50-100+
]

# Save to JSONL
with open("training_data.jsonl", "w") as f:
    for example in training_examples:
        f.write(json.dumps(example) + "\n")

# 2. Upload training file
with open("training_data.jsonl", "rb") as f:
    response = client.files.create(file=f, purpose="fine-tune")
training_file_id = response.id
print(f"Training file: {training_file_id}")

# 3. Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    model="gpt-4o-mini-2024-07-18",  # cheaper base model
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 1.8,
    }
)
print(f"Fine-tuning job: {job.id}")

# 4. Monitor progress
import time
while True:
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {job.status}")
    if job.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

# 5. Use the fine-tuned model
fine_tuned_model = job.fine_tuned_model
print(f"Fine-tuned model: {fine_tuned_model}")

response = client.chat.completions.create(
    model=fine_tuned_model,
    messages=[
        {"role": "system", "content": "You are a customer support agent for Acme Corp."},
        {"role": "user", "content": "How do I update my billing address?"}
    ]
)
print(response.choices[0].message.content)

LoRA Fine-Tuning (Open Source Models)

# pip install transformers peft datasets trl
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset

# Load base model
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LoRA config — train only small adapter layers
lora_config = LoraConfig(
    r=16,              # rank of adapter matrices
    lora_alpha=32,     # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,215,093,760 || trainable%: 0.13%
# Only 0.13% of parameters are trained — much cheaper!

# Prepare dataset
def format_example(example):
    return f"<|system|>You are a helpful assistant.\n<|user|>{example['input']}\n<|assistant|>{example['output']}"

dataset = Dataset.from_list([
    {"input": "What is 2+2?", "output": "4"},
    # ... your training examples
])
dataset = dataset.map(lambda x: {"text": format_example(x)})

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=TrainingArguments(
        output_dir="./lora-output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        fp16=True,
    ),
)
trainer.train()

# Save adapter
model.save_pretrained("./my-lora-adapter")

Hybrid: RAG + Fine-Tuning

The best production systems often combine both:

# Fine-tuned model handles format/style
# RAG provides current knowledge

from openai import OpenAI

client = OpenAI()

def hybrid_query(question: str, vectorstore) -> str:
    # Step 1: Retrieve relevant context (RAG)
    docs = vectorstore.similarity_search(question, k=4)
    context = "\n\n".join([doc.page_content for doc in docs])

    # Step 2: Use fine-tuned model with retrieved context
    response = client.chat.completions.create(
        model="ft:gpt-4o-mini:your-org:your-model-id",  # fine-tuned model
        messages=[
            {
                "role": "system",
                "content": "You are Acme Corp's support agent. Use the provided context to answer questions. Always cite the source document."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ],
        temperature=0,
    )

    return response.choices[0].message.content

Decision Framework

Use these questions to pick the right approach:

  1. Does your knowledge change more than monthly? Yes → RAG. No → fine-tuning might work.
  2. Do you need source citations? Yes → RAG. No → either works.
  3. Do you need consistent output format/style? Yes → fine-tuning (or system prompt + RAG). No → RAG is simpler.
  4. How many training examples do you have? Under 50 → RAG. 50+ → fine-tuning is viable. 500+ → fine-tuning will work well.
  5. Is latency critical? Yes → fine-tuning (no retrieval step). No → either works.

Cost Comparison

Approach Cost drivers Notes
RAG Vector DB hosting ($50–500/month), embedding API calls (~$0.0001/1K tokens), LLM inference at standard pricing Mostly LLM inference cost; DB cost is fixed
Fine-tuning Training: $0.008/1K tokens (GPT-4o-mini) ≈ $8 for 1M tokens; Inference: ~2× standard pricing Higher per-query cost but no vector DB

Rule of thumb: low query volume with a large, changing knowledge base → RAG. High query volume with stable knowledge → fine-tuning pays off over time.

Resources

Comments

👍 Was this article helpful?