Introduction
Fine-tuning large language models has become essential for building specialized AI applications. In 2025, with models like Llama 3, Mistral, and Phi-3 becoming openly available, fine-tuning has never been more accessible. This guide covers everything from choosing the right approach to deploying your custom model in production.
Understanding Fine-Tuning
Why Fine-Tune?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Pre-training vs Fine-tuning โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Pre-training: โ
โ โข Learn language from massive text corpus โ
โ โข General knowledge, patterns, grammar โ
โ โข 1T+ tokens, $M+ compute โ
โ โ
โ Fine-tuning: โ
โ โข Adapt to specific tasks or domains โ
โ โข Learn specialized knowledge โ
โ โข 10K-100K tokens, $100-$10K compute โ
โ โ
โ Result: Specialized model outperforms general โ
โ models on target tasks โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
When to Fine-Tune
# Fine-tuning decision matrix
scenarios:
fine_tune:
- name: "Domain-specific knowledge"
example: "Legal documents, medical texts"
reason: "Base model lacks specialized vocabulary"
- name: "Specific output format"
example: "JSON, code, structured responses"
reason: "Need consistent structured output"
- name: "Custom tone/style"
example: "Brand voice, writing style"
reason: "Consistent persona required"
- name: "Task-specific behavior"
example: "Classification, extraction"
reason: "Better task performance than prompting"
dont_fine_tune:
- name: "General question answering"
reason: "Base model sufficient"
- name: "Quick prototyping"
reason: "Use prompting first"
- name: "Limited data"
reason: "Few-shot prompting may work better"
Fine-Tuning Approaches
1. Full Fine-Tuning
# Full fine-tuning with PyTorch
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer
def full_fine_tune(model_name, train_dataset, eval_dataset):
"""Full parameter fine-tuning"""
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure training
training_args = TrainingArguments(
output_dir="./model_output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
fp16=True,
save_strategy="epoch",
evaluation_strategy="epoch",
logging_steps=10,
)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=lambda data: {
'input_ids': torch.stack([f['input_ids'] for f in data]),
'attention_mask': torch.stack([f['attention_mask'] for f in data]),
'labels': torch.stack([f['labels'] for f in data])
}
)
# Train
trainer.train()
# Save
model.save_pretrained("./final_model")
return model
2. LoRA (Low-Rank Adaptation)
# LoRA fine-tuning with PEFT
from peft import LoraConfig, get_peft_model, TaskType
def lora_fine_tune(model_name, train_dataset):
"""Parameter-efficient fine-tuning with LoRA"""
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank of adaptation matrices
lora_alpha=32, # Scaling factor
target_modules=[ # Which layers to adapt
"q_proj", "k_proj",
"v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 0.04% of total
# Train with LoRA
trainer = Trainer(
model=model,
train_dataset=train_dataset,
# ... other args
)
trainer.train()
# Merge LoRA weights for inference
model = model.merge_and_unload()
return model
3. QLoRA (Quantized LoRA)
# QLoRA with 4-bit quantization
from peft import LoraConfig, prepare_model_for_kbit_training
from bitsandbytes import BitsAndBytesConfig
def qlora_fine_tune(model_name, train_dataset):
"""Fine-tune with 4-bit quantized base model"""
# Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# LoRA config
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
# Train (uses much less GPU memory)
trainer = Trainer(
model=model,
train_dataset=train_dataset,
# ... args optimized for 4-bit
)
trainer.train()
return model
Dataset Preparation
Dataset Formats
# Training data formats
training_formats = {
"completion_format": {
"description": "Simple text completion",
"example": {
"text": "The capital of France is Paris, which is known for the Eiffel Tower."
}
},
"instruction_format": {
"description": "Instruction-response pairs",
"example": {
"instruction": "Summarize this article:",
"input": "Long article text...",
"output": "Concise summary..."
}
},
"chat_format": {
"description": "Multi-turn conversation",
"example": {
"messages": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help?"},
{"role": "user", "content": "What's AI?"},
{"role": "assistant", "content": "AI is..."}
]
}
}
}
Data Collection Strategies
# Data collection and curation
class DatasetCurator:
def __init__(self):
self.examples = []
def add_synthetic_examples(self, base_model, num_examples=1000):
"""Generate synthetic training data"""
prompts = [
"Generate a legal contract clause for:",
"Write a medical summary for:",
"Create a technical support response for:",
]
for prompt in prompts:
for _ in range(num_examples):
# Generate example
example = base_model.generate(prompt)
self.examples.append({
"instruction": prompt,
"output": example,
"source": "synthetic"
})
def add_human_examples(self, examples):
"""Add human-labeled examples"""
for ex in examples:
self.examples.append({
"instruction": ex["prompt"],
"output": ex["response"],
"source": "human",
"quality_score": ex.get("rating", 5)
})
def filter_by_quality(self, min_quality=4):
"""Filter low-quality examples"""
self.examples = [
ex for ex in self.examples
if ex.get("quality_score", 5) >= min_quality
]
def deduplicate(self):
"""Remove duplicate examples"""
seen = set()
unique = []
for ex in self.examples:
key = hash(ex["instruction"] + ex["output"])
if key not in seen:
seen.add(key)
unique.append(ex)
self.examples = unique
def export(self, format="instruction"):
"""Export for training"""
if format == "instruction":
return [
f"### Instruction\n{ex['instruction']}\n\n### Response\n{ex['output']}"
for ex in self.examples
]
Dataset Split
# Recommended dataset splits
dataset_split:
training: 80-90%
validation: 5-10%
test: 5-10%
rules:
- "Keep validation/test representative of production use cases"
- "Ensure no data leakage between splits"
- "Balance classes for classification tasks"
- "Minimum 100-500 examples for meaningful fine-tuning"
Training Infrastructure
Hardware Requirements
# GPU memory requirements (approximate)
model_sizes:
7B_params:
full_ft: "80GB+ (8x A100)"
lora: "24GB (1x A100)"
qlora: "10GB (1x A100)"
13B_params:
full_ft: "160GB+ (8x A100)"
lora: "40GB (2x A100)"
qlora: "16GB (1x A100)"
70B_params:
full_ft: "640GB+ (8x H100)"
lora: "160GB (8x A100)"
qlora: "48GB (2x A100)"
Cloud Training Options
# Training on cloud GPUs
cloud_providers = {
"AWS": {
"service": "SageMaker",
"gpus": ["p4d.24xlarge (8x A100)"],
"spot": True,
"estimated_cost": "$30-40/hour"
},
"Lambda Labs": {
"gpus": ["A100 80GB", "H100"],
"spot": True,
"estimated_cost": "$0.50-1.00/GPU/hour"
},
"Paperspace": {
"gpus": ["A100", "H100"],
"gradient": True,
"estimated_cost": "$0.70-1.20/GPU/hour"
},
"RunPod": {
"gpus": ["A100", "4090"],
"spot": True,
"estimated_cost": "$0.40-0.80/GPU/hour"
}
}
Training Process
Training Configuration
# Optimal training config for LoRA
training_config = {
"epochs": 3,
"batch_size": 8,
"gradient_accumulation": 4,
"learning_rate": 2e-4,
"warmup_ratio": 0.1,
"lr_scheduler": "cosine",
"weight_decay": 0.01,
"max_grad_norm": 1.0,
# LoRA specific
"lora_rank": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
# Data
"train_on_eos": False,
"append_eos_token": True,
# Optimization
"use_flash_attention": True,
"gradient_checkpointing": True,
}
Training Loop
# Custom training loop with logging
def train(
model,
train_loader,
optimizer,
scheduler,
device,
num_epochs
):
"""Training loop with monitoring"""
for epoch in range(num_epochs):
model.train()
total_loss = 0
for step, batch in enumerate(train_loader):
batch = {k: v.to(device) for k, v in batch.items()}
# Forward pass
outputs = model(**batch)
loss = outputs.loss / gradient_accumulation
# Backward pass
loss.backward()
if (step + 1) % gradient_accumulation == 0:
# Gradient clipping
torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=1.0
)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_loss += loss.item()
# Log metrics
if step % 100 == 0:
print(f"Step {step}: Loss = {loss.item():.4f}")
# Evaluate
eval_loss = evaluate(model, eval_loader)
print(f"Epoch {epoch}: Train Loss = {total_loss/len(train_loader):.4f}, Eval Loss = {eval_loss:.4f}")
Evaluation
Evaluation Metrics
# LLM evaluation metrics
metrics:
automatic:
- name: "Perplexity"
description: "Language modeling quality"
- name: "BLEU"
description: "N-gram overlap with reference"
- name: "ROUGE"
description: "Recall-oriented generation"
- name: "BERTScore"
description: "Semantic similarity"
human:
- name: "Helpfulness"
description: "Does the response help the user?"
- name: "Accuracy"
description: "Is the information correct?"
- name: "Coherence"
description: "Is the response well-structured?"
- name: "Safety"
description: "Any harmful content?"
Benchmarking
# Evaluate on standard benchmarks
from lm_eval import evaluate
def evaluate_model(model, tokenizer):
"""Evaluate on multiple benchmarks"""
# Load model for evaluation
lm = HuggingFaceModel(model=model, tokenizer=tokenizer)
# Run evaluations
results = evaluate(
lm,
["mmlu", "arc", "humaneval", "truthfulqa"]
)
# Print results
for task, scores in results.results.items():
print(f"{task}: {scores['acc']:.3f}")
return results
Merging and Deployment
Merge LoRA Weights
# Merge LoRA adapter with base model
from peft import PeftModel
def merge_and_export(base_model_path, adapter_path, output_path):
"""Merge adapter weights for deployment"""
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
device_map="cpu"
)
# Load and merge adapter
model = PeftModel.from_pretrained(
base_model,
adapter_path
)
model = model.merge_and_unload()
# Save merged model
model.save_pretrained(output_path)
print(f"Merged model saved to {output_path}")
Quantization for Deployment
# Convert to quantized format for inference
from transformers import BitsAndBytesConfig
def quantize_for_inference(model_path, output_path):
"""Quantize model to 4-bit for efficient inference"""
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4"
),
device_map="auto"
)
# Save quantized
model.save_pretrained(output_path)
Common Pitfalls
1. Overfitting
Wrong:
# Too many epochs, too little data
epochs = 50
learning_rate = 1e-3
# Result: Model memorizes training data
Correct:
# Proper regularization
epochs = 3
learning_rate = 2e-4
weight_decay = 0.01
# Use validation set to monitor overfitting
2. Catastrophic Forgetting
Wrong:
# Only train on new data
# Result: Model forgets general capabilities
Correct:
# Use instruction tuning + keep general examples
# Or use LoRA which preserves base model knowledge
# Or blend with original model outputs
3. Data Quality Issues
Wrong:
# Use any available data
# Don't filter noise
# Result: Poor model quality
Correct:
# Curate high-quality data
# Remove duplicates
# Balance examples
# Include diverse cases
Key Takeaways
- Start with LoRA/QLoRA - 99%+ parameter efficiency, lower cost
- Quality over quantity - 1K high-quality examples often beats 100K mediocre
- Use appropriate data format - Instruction format for chat models
- Monitor validation loss - Prevent overfitting
- Merge for deployment - Combine adapter with base model
- Quantize for inference - 4-bit reduces memory 4x with minimal quality loss
- Evaluate properly - Use both automatic and human metrics
Comments