Introduction
Fine-tuning large language models (LLMs) has become essential for creating domain-specific AI applications. However, full fine-tuning of billion-parameter models requires enormous computational resources. This guide covers the three main approaches: LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and RLHF (Reinforcement Learning from Human Feedback).
These techniques enable you to adapt models like LLaMA, Mistral, and Falcon efficiently while preserving their core capabilities.
Understanding LoRA
LoRA adds small trainable matrices to each transformer layer, dramatically reducing the number of parameters that need updating during training.
How LoRA Works
Instead of updating all model weights, LoRA introduces low-rank decomposition:
Original: W โ R^(dรk)
LoRA: W + ฮW = W + BA
Where: B โ R^(dรr), A โ R^(rรk), r << min(d,k)
The rank r is typically 8-32, making trainable parameters 1-3% of the original model.
LoRA Implementation with PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=[
"q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
# Training data
train_data = [
{"instruction": "Summarize this article", "input": "Long article text...", "output": "Brief summary..."}
]
# Fine-tune
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch",
fp16=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_data,
tokenizer=tokenizer
)
trainer.train()
LoRA Weights Merging
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "./lora-output")
# Merge and save (for inference without LoRA overhead)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-llama-2-7b-custom")
Understanding QLoRA
QLoRA combines LoRA with model quantization, enabling fine-tuning of 65B+ parameter models on a single GPU.
QLoRA Key Techniques
- 4-bit Quantization: Store weights in 4-bit format using NF4
- LoRA on Quantized Weights: Apply LoRA to quantized model
- Gradient Checkpointing: Trade compute for memory
- Frozen Gradients: Don’t compute gradients for quantized weights
QLoRA Implementation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# LoRA config with lower rank for quantized models
lora_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
# Memory comparison
print(f"Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")
# Typical: 70B model fits in ~14 QLoRA
GB with### QLoRA with transformers
# Using the latest transformers with QLoRA
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
# More aggressive quantization for larger models
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False
)
# Gradient checkpointing for memory savings
training_args = TrainingArguments(
output_dir="./qlora-70b",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=1e-4,
num_train_epochs=3,
max_steps=1000,
logging_steps=10,
save_steps=500,
save_total_limit=2,
bf16=True,
tf32=True,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
optim="paged_adamw_32bit",
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.001,
max_grad_norm=0.3
)
Understanding RLHF
RLHF aligns language models with human preferences through reinforcement learning. It’s the technique behind ChatGPT and Claude.
RLHF Three-Stage Process
- Supervised Fine-Tuning (SFT): Fine-tune on human-written responses
- Reward Modeling: Train a reward model to score responses
- Reinforcement Learning (PPO): Optimize the policy using the reward model
Stage 1: Supervised Fine-Tuning
# Prepare SFT data
sft_data = [
{
"prompt": "Explain quantum computing to a 5-year-old",
"completion": """Quantum computing is like having a magical computer that can be in many states at once.
Imagine you have a coin. A regular computer says it's either heads OR tails.
But a quantum computer can be in a special state where it's both heads AND tails at the same time - until you look at it!
This magic trick lets quantum computers solve certain problems่ถ
็บงๅฟซ..."""
},
# ... more examples
]
# Format for training
def format_prompt(example):
return f"""### Instruction:
{example['prompt']}
### Response:
{example['completion']}"""
# Tokenize
train_dataset = Dataset.from_list(sft_data)
train_dataset = train_dataset.map(
lambda x: {"text": format_prompt(x)},
remove_columns=train_dataset.column_names
)
# SFT Training
from transformers import AutoModelForCausalLM, Trainer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto"
)
trainer = Trainer(
model=model,
train_dataset=train_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()
Stage 2: Reward Modeling
# Reward model training
from transformers import AutoModelForSequenceClassification
# Load pretrained model as reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-7b-hf",
num_labels=1, # Single score
load_in_8bit=True
)
# Reward training data (chosen > rejected)
reward_data = [
{
"prompt": "Write a haiku about winter",
"chosen": "Snow blankets the earth\nSilent white crystals fall\nWinter's peaceful sleep",
"rejected": "winter is cold and i dont like it because its cold and snow is wet and"
}
]
# Contrastive loss for reward model
def compute_reward_loss(chosen_scores, rejected_scores):
# Chosen should have higher score than rejected
loss = -torch.log(torch.sigmoid(chosen_scores - rejected_scores)).mean()
return loss
Stage 3: PPO Training
from trl import PPOTrainer, PPOConfig
from trl.core import LengthSampler
# PPO Configuration
ppo_config = PPOConfig(
model_name="meta-llama/Llama-2-7b-hf",
learning_rate=1.4e-5,
batch_size=512,
mini_batch_size=1,
gradient_accumulation_steps=16,
ppo_epochs=4,
target_kl=0.1,
init_kl_coef=0.2
)
# Initialize PPO Trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
dataset=dataset,
data_collator=data_collator
)
# Training loop
for epoch in range(num_epochs):
for batch in ppo_trainer.dataloader:
# Generate responses
query_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(
query_tensors,
return_prompt=False,
length_sampler=LengthSampler(4, 32)
)
# Get rewards
texts = tokenizer.batch_decode(response_tensors)
rewards = reward_model(texts)
# PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
Comparison: LoRA vs QLoRA vs RLHF
| Aspect | LoRA | QLoRA | RLHF |
|---|---|---|---|
| Parameters Updated | 1-3% | < 1% | 100% or LoRA |
| GPU Memory | ~40GB for 70B | ~14GB for 70B | ~80GB for 7B |
| Training Time | Hours | Hours | Days |
| Alignment Quality | Good | Good | Best |
| Use Case | Domain adaptation | Resource-constrained | Chatbot alignment |
| Complexity | Low | Medium | High |
When to Use Each Technique
Use LoRA When:
- You have access to GPUs with 40-80GB VRAM
- You need to fine-tune for specific domains
- You want a balance of quality and efficiency
# Good: LoRA for domain adaptation
lora_config = LoraConfig(r=16, target_modules=["q_proj", "v_proj"])
# Works well for: Legal, Medical, Technical domains
Use QLoRA When:
- You have limited GPU resources
- You want to fine-tune large models (70B+)
- You’re experimenting with different base models
# Good: QLoRA for resource-constrained environments
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
# 70B model fits on single A100 (40GB)
Use RLHF When:
- Building conversational AI
- Need human-like responses
- Have training data with preferences
- Resources for multi-stage training
# Good: RLHF for chatbot training
# Stage 1: SFT on instruction data
# Stage 2: Reward model on preferences
# Stage 3: PPO optimization
Bad Practices to Avoid
Bad Practice 1: Using Too High Rank
# Bad: Rank too high defeats the purpose
lora_config = LoraConfig(r=128) # Too many parameters
# Should be: r=8-32 for most use cases
Bad Practice 2: Wrong Target Modules
# Bad: Missing key modules for causal LMs
target_modules=["q_proj"] # Incomplete
# Good: Include all key projection layers
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
Bad Practice 3: No Data Quality Check
# Bad: Training on noisy/incorrect data
train_data = load_any_data() # No filtering
# Good: Filter and validate training data
train_data = filter_by_quality(train_data, min_score=4.0)
Good Practices Summary
LoRA Best Practices
- Target all projection layers: Include q, k, v, o projections
- Use appropriate rank: 8-32 for most tasks
- Apply to base model first: Then fine-tune LoRA weights
# Good: Comprehensive LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=2 * 16, # Rule of thumb
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none"
)
QLoRA Best Practices
- Use NF4 quantization for better accuracy
- Enable double quantization for memory savings
- Use paged optimizers to prevent memory spikes
# Good: Optimized QLoRA config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
RLHF Best Practices
- High-quality SFT data is crucial
- Diverse preference data for reward model
- KL penalty to prevent mode collapse
# Good: Balanced PPO training
ppo_config = PPOConfig(
target_kl=0.1, # Control deviation from reference
init_kl_coef=0.2 # Initial KL penalty
)
External Resources
- LoRA Paper - Microsoft Research
- QLoRA Paper
- InstructGPT Paper (RLHF)
- PEFT Library Documentation
- TRL Library - Transformers Reinforcement Learning
- LoRAX - Multi-tenant LoRA Serving
- DeepSpeed RLHF Pipeline
- Axolotl - Easy LoRA/QLoRA Training
Comments