LLM Fine-tuning: LoRA, QLoRA, and RLHF - Complete Guide

Introduction

Fine-tuning large language models (LLMs) has become essential for creating domain-specific AI applications. However, full fine-tuning of billion-parameter models requires enormous computational resources. This guide covers the three main approaches: LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and RLHF (Reinforcement Learning from Human Feedback).

These techniques enable you to adapt models like LLaMA, Mistral, and Falcon efficiently while preserving their core capabilities.

Understanding LoRA

LoRA adds small trainable matrices to each transformer layer, dramatically reducing the number of parameters that need updating during training.

How LoRA Works

Instead of updating all model weights, LoRA introduces low-rank decomposition:

Original: W ∈ R^(d×k)
LoRA: W + ΔW = W + BA
Where: B ∈ R^(d×r), A ∈ R^(r×k), r << min(d,k)

The rank r is typically 8-32, making trainable parameters 1-3% of the original model.

LoRA Implementation with PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # Rank
    lora_alpha=32,  # Scaling factor
    target_modules=[
        "q_proj", "k_proj", "v_proj", 
        "o_proj", "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062

# Training data
train_data = [
    {"instruction": "Summarize this article", "input": "Long article text...", "output": "Brief summary..."}
]

# Fine-tune
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    tokenizer=tokenizer
)

trainer.train()

LoRA Weights Merging

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "./lora-output")

# Merge and save (for inference without LoRA overhead)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-llama-2-7b-custom")

Understanding QLoRA

QLoRA combines LoRA with model quantization, enabling fine-tuning of 65B+ parameter models on a single GPU.

QLoRA Key Techniques

4-bit Quantization: Store weights in 4-bit format using NF4
LoRA on Quantized Weights: Apply LoRA to quantized model
Gradient Checkpointing: Trade compute for memory
Frozen Gradients: Don’t compute gradients for quantized weights

QLoRA Implementation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# LoRA config with lower rank for quantized models
lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

# Memory comparison
print(f"Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")
# Typical: 70B model fits in ~14 QLoRA

GB with### QLoRA with transformers

# Using the latest transformers with QLoRA
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# More aggressive quantization for larger models
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False
)

# Gradient checkpointing for memory savings
training_args = TrainingArguments(
    output_dir="./qlora-70b",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=1e-4,
    num_train_epochs=3,
    max_steps=1000,
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    bf16=True,
    tf32=True,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_32bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    weight_decay=0.001,
    max_grad_norm=0.3
)

Understanding RLHF

RLHF aligns language models with human preferences through reinforcement learning. It’s the technique behind ChatGPT and Claude.

RLHF Three-Stage Process

Supervised Fine-Tuning (SFT): Fine-tune on human-written responses
Reward Modeling: Train a reward model to score responses
Reinforcement Learning (PPO): Optimize the policy using the reward model

Stage 1: Supervised Fine-Tuning

# Prepare SFT data
sft_data = [
    {
        "prompt": "Explain quantum computing to a 5-year-old",
        "completion": """Quantum computing is like having a magical computer that can be in many states at once.

Imagine you have a coin. A regular computer says it's either heads OR tails. 

But a quantum computer can be in a special state where it's both heads AND tails at the same time - until you look at it!

This magic trick lets quantum computers solve certain problems超级快..."""
    },
    # ... more examples
]

# Format for training
def format_prompt(example):
    return f"""### Instruction:
{example['prompt']}

### Response:
{example['completion']}"""

# Tokenize
train_dataset = Dataset.from_list(sft_data)
train_dataset = train_dataset.map(
    lambda x: {"text": format_prompt(x)},
    remove_columns=train_dataset.column_names
)

# SFT Training
from transformers import AutoModelForCausalLM, Trainer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

Stage 2: Reward Modeling

# Reward model training
from transformers import AutoModelForSequenceClassification

# Load pretrained model as reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    num_labels=1,  # Single score
    load_in_8bit=True
)

# Reward training data (chosen > rejected)
reward_data = [
    {
        "prompt": "Write a haiku about winter",
        "chosen": "Snow blankets the earth\nSilent white crystals fall\nWinter's peaceful sleep",
        "rejected": "winter is cold and i dont like it because its cold and snow is wet and"
    }
]

# Contrastive loss for reward model
def compute_reward_loss(chosen_scores, rejected_scores):
    # Chosen should have higher score than rejected
    loss = -torch.log(torch.sigmoid(chosen_scores - rejected_scores)).mean()
    return loss

Stage 3: PPO Training

from trl import PPOTrainer, PPOConfig
from trl.core import LengthSampler

# PPO Configuration
ppo_config = PPOConfig(
    model_name="meta-llama/Llama-2-7b-hf",
    learning_rate=1.4e-5,
    batch_size=512,
    mini_batch_size=1,
    gradient_accumulation_steps=16,
    ppo_epochs=4,
    target_kl=0.1,
    init_kl_coef=0.2
)

# Initialize PPO Trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    dataset=dataset,
    data_collator=data_collator
)

# Training loop
for epoch in range(num_epochs):
    for batch in ppo_trainer.dataloader:
        # Generate responses
        query_tensors = batch["input_ids"]
        response_tensors = ppo_trainer.generate(
            query_tensors, 
            return_prompt=False,
            length_sampler=LengthSampler(4, 32)
        )
        
        # Get rewards
        texts = tokenizer.batch_decode(response_tensors)
        rewards = reward_model(texts)
        
        # PPO step
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

Comparison: LoRA vs QLoRA vs RLHF

Aspect	LoRA	QLoRA	RLHF
Parameters Updated	1-3%	< 1%	100% or LoRA
GPU Memory	~40GB for 70B	~14GB for 70B	~80GB for 7B
Training Time	Hours	Hours	Days
Alignment Quality	Good	Good	Best
Use Case	Domain adaptation	Resource-constrained	Chatbot alignment
Complexity	Low	Medium	High

When to Use Each Technique

Use LoRA When:

You have access to GPUs with 40-80GB VRAM
You need to fine-tune for specific domains
You want a balance of quality and efficiency

# Good: LoRA for domain adaptation
lora_config = LoraConfig(r=16, target_modules=["q_proj", "v_proj"])
# Works well for: Legal, Medical, Technical domains

Use QLoRA When:

You have limited GPU resources
You want to fine-tune large models (70B+)
You’re experimenting with different base models

# Good: QLoRA for resource-constrained environments
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
# 70B model fits on single A100 (40GB)

Use RLHF When:

Building conversational AI
Need human-like responses
Have training data with preferences
Resources for multi-stage training

# Good: RLHF for chatbot training
# Stage 1: SFT on instruction data
# Stage 2: Reward model on preferences  
# Stage 3: PPO optimization

Bad Practices to Avoid

Bad Practice 1: Using Too High Rank

# Bad: Rank too high defeats the purpose
lora_config = LoraConfig(r=128)  # Too many parameters
# Should be: r=8-32 for most use cases

Bad Practice 2: Wrong Target Modules

# Bad: Missing key modules for causal LMs
target_modules=["q_proj"]  # Incomplete

# Good: Include all key projection layers
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]

Bad Practice 3: No Data Quality Check

# Bad: Training on noisy/incorrect data
train_data = load_any_data()  # No filtering

# Good: Filter and validate training data
train_data = filter_by_quality(train_data, min_score=4.0)

Good Practices Summary

LoRA Best Practices

Target all projection layers: Include q, k, v, o projections
Use appropriate rank: 8-32 for most tasks
Apply to base model first: Then fine-tune LoRA weights

# Good: Comprehensive LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=2 * 16,  # Rule of thumb
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                   "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none"
)

QLoRA Best Practices

Use NF4 quantization for better accuracy
Enable double quantization for memory savings
Use paged optimizers to prevent memory spikes

# Good: Optimized QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

RLHF Best Practices

High-quality SFT data is crucial
Diverse preference data for reward model
KL penalty to prevent mode collapse

# Good: Balanced PPO training
ppo_config = PPOConfig(
    target_kl=0.1,  # Control deviation from reference
    init_kl_coef=0.2  # Initial KL penalty
)