Skip to main content
โšก Calmops

Fine-Tuning LLMs: Custom Model Training in 2026

Introduction

Fine-tuning large language models has become essential for building specialized AI applications. In 2025, with models like Llama 3, Mistral, and Phi-3 becoming openly available, fine-tuning has never been more accessible. This guide covers everything from choosing the right approach to deploying your custom model in production.


Understanding Fine-Tuning

Why Fine-Tune?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Pre-training vs Fine-tuning                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  Pre-training:                                              โ”‚
โ”‚  โ€ข Learn language from massive text corpus                 โ”‚
โ”‚  โ€ข General knowledge, patterns, grammar                     โ”‚
โ”‚  โ€ข 1T+ tokens, $M+ compute                                 โ”‚
โ”‚                                                             โ”‚
โ”‚  Fine-tuning:                                               โ”‚
โ”‚  โ€ข Adapt to specific tasks or domains                       โ”‚
โ”‚  โ€ข Learn specialized knowledge                              โ”‚
โ”‚  โ€ข 10K-100K tokens, $100-$10K compute                      โ”‚
โ”‚                                                             โ”‚
โ”‚  Result: Specialized model outperforms general             โ”‚
โ”‚  models on target tasks                                     โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

When to Fine-Tune

# Fine-tuning decision matrix
scenarios:
  fine_tune:
    - name: "Domain-specific knowledge"
      example: "Legal documents, medical texts"
      reason: "Base model lacks specialized vocabulary"
      
    - name: "Specific output format"
      example: "JSON, code, structured responses"
      reason: "Need consistent structured output"
      
    - name: "Custom tone/style"
      example: "Brand voice, writing style"
      reason: "Consistent persona required"
      
    - name: "Task-specific behavior"
      example: "Classification, extraction"
      reason: "Better task performance than prompting"
      
  dont_fine_tune:
    - name: "General question answering"
      reason: "Base model sufficient"
      
    - name: "Quick prototyping"
      reason: "Use prompting first"
      
    - name: "Limited data"
      reason: "Few-shot prompting may work better"

Fine-Tuning Approaches

1. Full Fine-Tuning

# Full fine-tuning with PyTorch
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer

def full_fine_tune(model_name, train_dataset, eval_dataset):
    """Full parameter fine-tuning"""
    
    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Configure training
    training_args = TrainingArguments(
        output_dir="./model_output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-5,
        fp16=True,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        logging_steps=10,
    )
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=lambda data: {
            'input_ids': torch.stack([f['input_ids'] for f in data]),
            'attention_mask': torch.stack([f['attention_mask'] for f in data]),
            'labels': torch.stack([f['labels'] for f in data])
        }
    )
    
    # Train
    trainer.train()
    
    # Save
    model.save_pretrained("./final_model")
    
    return model

2. LoRA (Low-Rank Adaptation)

# LoRA fine-tuning with PEFT
from peft import LoraConfig, get_peft_model, TaskType

def lora_fine_tune(model_name, train_dataset):
    """Parameter-efficient fine-tuning with LoRA"""
    
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    
    # Configure LoRA
    lora_config = LoraConfig(
        r=16,                    # Rank of adaptation matrices
        lora_alpha=32,           # Scaling factor
        target_modules=[         # Which layers to adapt
            "q_proj", "k_proj", 
            "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.CAUSAL_LM
    )
    
    # Apply LoRA
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # Output: trainable params: 0.04% of total
    
    # Train with LoRA
    trainer = Trainer(
        model=model,
        train_dataset=train_dataset,
        # ... other args
    )
    
    trainer.train()
    
    # Merge LoRA weights for inference
    model = model.merge_and_unload()
    
    return model

3. QLoRA (Quantized LoRA)

# QLoRA with 4-bit quantization
from peft import LoraConfig, prepare_model_for_kbit_training
from bitsandbytes import BitsAndBytesConfig

def qlora_fine_tune(model_name, train_dataset):
    """Fine-tune with 4-bit quantized base model"""
    
    # Quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True
    )
    
    # Load quantized model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )
    
    # Prepare for training
    model = prepare_model_for_kbit_training(model)
    
    # LoRA config
    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        task_type=TaskType.CAUSAL_LM
    )
    
    model = get_peft_model(model, lora_config)
    
    # Train (uses much less GPU memory)
    trainer = Trainer(
        model=model,
        train_dataset=train_dataset,
        # ... args optimized for 4-bit
    )
    
    trainer.train()
    
    return model

Dataset Preparation

Dataset Formats

# Training data formats
training_formats = {
    "completion_format": {
        "description": "Simple text completion",
        "example": {
            "text": "The capital of France is Paris, which is known for the Eiffel Tower."
        }
    },
    
    "instruction_format": {
        "description": "Instruction-response pairs",
        "example": {
            "instruction": "Summarize this article:",
            "input": "Long article text...",
            "output": "Concise summary..."
        }
    },
    
    "chat_format": {
        "description": "Multi-turn conversation",
        "example": {
            "messages": [
                {"role": "user", "content": "Hello!"},
                {"role": "assistant", "content": "Hi! How can I help?"},
                {"role": "user", "content": "What's AI?"},
                {"role": "assistant", "content": "AI is..."}
            ]
        }
    }
}

Data Collection Strategies

# Data collection and curation
class DatasetCurator:
    def __init__(self):
        self.examples = []
    
    def add_synthetic_examples(self, base_model, num_examples=1000):
        """Generate synthetic training data"""
        prompts = [
            "Generate a legal contract clause for:",
            "Write a medical summary for:",
            "Create a technical support response for:",
        ]
        
        for prompt in prompts:
            for _ in range(num_examples):
                # Generate example
                example = base_model.generate(prompt)
                self.examples.append({
                    "instruction": prompt,
                    "output": example,
                    "source": "synthetic"
                })
    
    def add_human_examples(self, examples):
        """Add human-labeled examples"""
        for ex in examples:
            self.examples.append({
                "instruction": ex["prompt"],
                "output": ex["response"],
                "source": "human",
                "quality_score": ex.get("rating", 5)
            })
    
    def filter_by_quality(self, min_quality=4):
        """Filter low-quality examples"""
        self.examples = [
            ex for ex in self.examples
            if ex.get("quality_score", 5) >= min_quality
        ]
    
    def deduplicate(self):
        """Remove duplicate examples"""
        seen = set()
        unique = []
        
        for ex in self.examples:
            key = hash(ex["instruction"] + ex["output"])
            if key not in seen:
                seen.add(key)
                unique.append(ex)
        
        self.examples = unique
    
    def export(self, format="instruction"):
        """Export for training"""
        if format == "instruction":
            return [
                f"### Instruction\n{ex['instruction']}\n\n### Response\n{ex['output']}"
                for ex in self.examples
            ]

Dataset Split

# Recommended dataset splits
dataset_split:
  training: 80-90%
  validation: 5-10%
  test: 5-10%
  
rules:
  - "Keep validation/test representative of production use cases"
  - "Ensure no data leakage between splits"
  - "Balance classes for classification tasks"
  - "Minimum 100-500 examples for meaningful fine-tuning"

Training Infrastructure

Hardware Requirements

# GPU memory requirements (approximate)
model_sizes:
  7B_params:
    full_ft: "80GB+ (8x A100)"
    lora: "24GB (1x A100)"
    qlora: "10GB (1x A100)"
    
  13B_params:
    full_ft: "160GB+ (8x A100)"
    lora: "40GB (2x A100)"
    qlora: "16GB (1x A100)"
    
  70B_params:
    full_ft: "640GB+ (8x H100)"
    lora: "160GB (8x A100)"
    qlora: "48GB (2x A100)"

Cloud Training Options

# Training on cloud GPUs
cloud_providers = {
    "AWS": {
        "service": "SageMaker",
        "gpus": ["p4d.24xlarge (8x A100)"],
        "spot": True,
        "estimated_cost": "$30-40/hour"
    },
    
    "Lambda Labs": {
        "gpus": ["A100 80GB", "H100"],
        "spot": True,
        "estimated_cost": "$0.50-1.00/GPU/hour"
    },
    
    "Paperspace": {
        "gpus": ["A100", "H100"],
        "gradient": True,
        "estimated_cost": "$0.70-1.20/GPU/hour"
    },
    
    "RunPod": {
        "gpus": ["A100", "4090"],
        "spot": True,
        "estimated_cost": "$0.40-0.80/GPU/hour"
    }
}

Training Process

Training Configuration

# Optimal training config for LoRA
training_config = {
    "epochs": 3,
    "batch_size": 8,
    "gradient_accumulation": 4,
    "learning_rate": 2e-4,
    "warmup_ratio": 0.1,
    "lr_scheduler": "cosine",
    "weight_decay": 0.01,
    "max_grad_norm": 1.0,
    
    # LoRA specific
    "lora_rank": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    
    # Data
    "train_on_eos": False,
    "append_eos_token": True,
    
    # Optimization
    "use_flash_attention": True,
    "gradient_checkpointing": True,
}

Training Loop

# Custom training loop with logging
def train(
    model, 
    train_loader, 
    optimizer, 
    scheduler,
    device,
    num_epochs
):
    """Training loop with monitoring"""
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for step, batch in enumerate(train_loader):
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Forward pass
            outputs = model(**batch)
            loss = outputs.loss / gradient_accumulation
            
            # Backward pass
            loss.backward()
            
            if (step + 1) % gradient_accumulation == 0:
                # Gradient clipping
                torch.nn.utils.clip_grad_norm_(
                    model.parameters(), 
                    max_norm=1.0
                )
                
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
            
            total_loss += loss.item()
            
            # Log metrics
            if step % 100 == 0:
                print(f"Step {step}: Loss = {loss.item():.4f}")
        
        # Evaluate
        eval_loss = evaluate(model, eval_loader)
        print(f"Epoch {epoch}: Train Loss = {total_loss/len(train_loader):.4f}, Eval Loss = {eval_loss:.4f}")

Evaluation

Evaluation Metrics

# LLM evaluation metrics
metrics:
  automatic:
    - name: "Perplexity"
      description: "Language modeling quality"
      
    - name: "BLEU"
      description: "N-gram overlap with reference"
      
    - name: "ROUGE"
      description: "Recall-oriented generation"
      
    - name: "BERTScore"
      description: "Semantic similarity"
      
  human:
    - name: "Helpfulness"
      description: "Does the response help the user?"
      
    - name: "Accuracy"
      description: "Is the information correct?"
      
    - name: "Coherence"
      description: "Is the response well-structured?"
      
    - name: "Safety"
      description: "Any harmful content?"

Benchmarking

# Evaluate on standard benchmarks
from lm_eval import evaluate

def evaluate_model(model, tokenizer):
    """Evaluate on multiple benchmarks"""
    
    # Load model for evaluation
    lm = HuggingFaceModel(model=model, tokenizer=tokenizer)
    
    # Run evaluations
    results = evaluate(
        lm,
        ["mmlu", "arc", "humaneval", "truthfulqa"]
    )
    
    # Print results
    for task, scores in results.results.items():
        print(f"{task}: {scores['acc']:.3f}")
    
    return results

Merging and Deployment

Merge LoRA Weights

# Merge LoRA adapter with base model
from peft import PeftModel

def merge_and_export(base_model_path, adapter_path, output_path):
    """Merge adapter weights for deployment"""
    
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_path,
        torch_dtype=torch.float16,
        device_map="cpu"
    )
    
    # Load and merge adapter
    model = PeftModel.from_pretrained(
        base_model,
        adapter_path
    )
    model = model.merge_and_unload()
    
    # Save merged model
    model.save_pretrained(output_path)
    
    print(f"Merged model saved to {output_path}")

Quantization for Deployment

# Convert to quantized format for inference
from transformers import BitsAndBytesConfig

def quantize_for_inference(model_path, output_path):
    """Quantize model to 4-bit for efficient inference"""
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4"
        ),
        device_map="auto"
    )
    
    # Save quantized
    model.save_pretrained(output_path)

Common Pitfalls

1. Overfitting

Wrong:

# Too many epochs, too little data
epochs = 50
learning_rate = 1e-3
# Result: Model memorizes training data

Correct:

# Proper regularization
epochs = 3
learning_rate = 2e-4
weight_decay = 0.01
# Use validation set to monitor overfitting

2. Catastrophic Forgetting

Wrong:

# Only train on new data
# Result: Model forgets general capabilities

Correct:

# Use instruction tuning + keep general examples
# Or use LoRA which preserves base model knowledge
# Or blend with original model outputs

3. Data Quality Issues

Wrong:

# Use any available data
# Don't filter noise
# Result: Poor model quality

Correct:

# Curate high-quality data
# Remove duplicates
# Balance examples
# Include diverse cases

Key Takeaways

  • Start with LoRA/QLoRA - 99%+ parameter efficiency, lower cost
  • Quality over quantity - 1K high-quality examples often beats 100K mediocre
  • Use appropriate data format - Instruction format for chat models
  • Monitor validation loss - Prevent overfitting
  • Merge for deployment - Combine adapter with base model
  • Quantize for inference - 4-bit reduces memory 4x with minimal quality loss
  • Evaluate properly - Use both automatic and human metrics

External Resources

Documentation

Tools

Learning

Comments