Introduction
When you need an LLM to know about your company’s products, internal docs, or recent events, you have two main options: RAG (Retrieval-Augmented Generation) or Fine-Tuning. They solve different problems. Choosing the wrong one wastes time and money.
Quick decision:
- Knowledge changes frequently โ RAG
- Behavior/style needs to change โ Fine-Tuning
- Both โ Hybrid
What Each Approach Does
RAG:
User query โ search knowledge base โ inject relevant docs into prompt โ LLM answers
The model's weights don't change. Knowledge lives in a vector database.
Fine-Tuning:
Training examples โ update model weights โ model "knows" the new information
The knowledge is baked into the model. No external retrieval needed.
RAG: Implementation
Basic RAG Pipeline
# pip install langchain langchain-openai chromadb
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(documents)
# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True,
)
# 5. Query
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])
Advanced RAG: Hybrid Search + Reranking
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Hybrid: vector search + BM25 keyword search
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Combine both retrievers
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6] # weight keyword vs semantic
)
# Rerank results with a cross-encoder
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=4)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=ensemble_retriever
)
# Use in chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=compression_retriever,
)
RAG Evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Evaluate RAG quality
test_questions = [
"What is our return policy?",
"How do I reset my password?",
]
# Generate answers with context
results = []
for q in test_questions:
result = qa_chain.invoke({"query": q})
results.append({
"question": q,
"answer": result["result"],
"contexts": [doc.page_content for doc in result["source_documents"]],
})
# Score with RAGAS
from datasets import Dataset
dataset = Dataset.from_list(results)
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(scores)
# faithfulness: 0.92 (answer grounded in retrieved docs)
# answer_relevancy: 0.88 (answer relevant to question)
# context_precision: 0.85 (retrieved docs are relevant)
Fine-Tuning: Implementation
When Fine-Tuning Makes Sense
โ You need consistent output format (always return JSON, always use specific structure)
โ You need domain-specific tone/style (legal language, medical terminology)
โ You have 100+ high-quality examples of the exact behavior you want
โ The knowledge is stable (doesn't change weekly)
โ Latency matters (no retrieval step)
โ Your knowledge base changes frequently
โ You need source citations
โ You have < 50 examples
โ You need to update knowledge without retraining
OpenAI Fine-Tuning
from openai import OpenAI
import json
client = OpenAI()
# 1. Prepare training data (JSONL format)
training_examples = [
{
"messages": [
{"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
{"role": "user", "content": "How do I cancel my subscription?"},
{"role": "assistant", "content": "To cancel your subscription: 1) Log into your account, 2) Go to Settings > Billing, 3) Click 'Cancel Subscription', 4) Confirm cancellation. Your access continues until the end of the billing period."}
]
},
{
"messages": [
{"role": "system", "content": "You are a customer support agent for Acme Corp. Always be concise and professional."},
{"role": "user", "content": "What payment methods do you accept?"},
{"role": "assistant", "content": "We accept Visa, Mastercard, American Express, and PayPal. All payments are processed securely via Stripe."}
]
},
# ... need at least 10 examples, ideally 50-100+
]
# Save to JSONL
with open("training_data.jsonl", "w") as f:
for example in training_examples:
f.write(json.dumps(example) + "\n")
# 2. Upload training file
with open("training_data.jsonl", "rb") as f:
response = client.files.create(file=f, purpose="fine-tune")
training_file_id = response.id
print(f"Training file: {training_file_id}")
# 3. Start fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file_id,
model="gpt-4o-mini-2024-07-18", # cheaper base model
hyperparameters={
"n_epochs": 3,
"batch_size": 4,
"learning_rate_multiplier": 1.8,
}
)
print(f"Fine-tuning job: {job.id}")
# 4. Monitor progress
import time
while True:
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job.status}")
if job.status in ["succeeded", "failed"]:
break
time.sleep(30)
# 5. Use the fine-tuned model
fine_tuned_model = job.fine_tuned_model
print(f"Fine-tuned model: {fine_tuned_model}")
response = client.chat.completions.create(
model=fine_tuned_model,
messages=[
{"role": "system", "content": "You are a customer support agent for Acme Corp."},
{"role": "user", "content": "How do I update my billing address?"}
]
)
print(response.choices[0].message.content)
LoRA Fine-Tuning (Open Source Models)
# pip install transformers peft datasets trl
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset
# Load base model
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# LoRA config โ train only small adapter layers
lora_config = LoraConfig(
r=16, # rank of adapter matrices
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,215,093,760 || trainable%: 0.13%
# Only 0.13% of parameters are trained โ much cheaper!
# Prepare dataset
def format_example(example):
return f"<|system|>You are a helpful assistant.\n<|user|>{example['input']}\n<|assistant|>{example['output']}"
dataset = Dataset.from_list([
{"input": "What is 2+2?", "output": "4"},
# ... your training examples
])
dataset = dataset.map(lambda x: {"text": format_example(x)})
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
args=TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
fp16=True,
),
)
trainer.train()
# Save adapter
model.save_pretrained("./my-lora-adapter")
Hybrid: RAG + Fine-Tuning
The best production systems often combine both:
# Fine-tuned model handles format/style
# RAG provides current knowledge
from openai import OpenAI
client = OpenAI()
def hybrid_query(question: str, vectorstore) -> str:
# Step 1: Retrieve relevant context (RAG)
docs = vectorstore.similarity_search(question, k=4)
context = "\n\n".join([doc.page_content for doc in docs])
# Step 2: Use fine-tuned model with retrieved context
response = client.chat.completions.create(
model="ft:gpt-4o-mini:your-org:your-model-id", # fine-tuned model
messages=[
{
"role": "system",
"content": "You are Acme Corp's support agent. Use the provided context to answer questions. Always cite the source document."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
],
temperature=0,
)
return response.choices[0].message.content
Decision Framework
1. Does your knowledge change more than monthly?
YES โ RAG (update the vector DB, not the model)
NO โ Fine-tuning might work
2. Do you need source citations?
YES โ RAG (can return source documents)
NO โ Either works
3. Do you need consistent output format/style?
YES โ Fine-tuning (or system prompt + RAG)
NO โ RAG is simpler
4. How many training examples do you have?
< 50 โ RAG (fine-tuning needs more data)
50+ โ Fine-tuning is viable
500+ โ Fine-tuning will work well
5. Is latency critical?
YES โ Fine-tuning (no retrieval step)
NO โ Either works
Cost Comparison
RAG costs:
- Vector DB hosting: $50-500/month (Pinecone, Weaviate, Chroma Cloud)
- Embedding API calls: ~$0.0001 per 1K tokens
- LLM inference: standard API pricing
- Total: mostly LLM inference cost
Fine-tuning costs:
- Training: $0.008/1K tokens (GPT-4o-mini) = ~$8 for 1M tokens
- Inference: 2x standard pricing for fine-tuned models
- Total: higher per-query cost, but no vector DB
Rule of thumb:
- Low query volume + large knowledge base โ RAG
- High query volume + stable knowledge โ Fine-tuning
Resources
- LangChain RAG Tutorial
- OpenAI Fine-Tuning Guide
- RAGAS Evaluation Framework
- Hugging Face PEFT (LoRA)
- Pinecone Vector Database
Comments