Fine Tune an open-source LLM

Fine-tuning an open-source Large Language Model (LLM) unlocks the ability to personalize language understanding and generation for specific tasks, domains, or industries. Whether you're building a legal assistant, coding helper, medical Q&A bot, or document summarizer — fine-tuning offers control, consistency, and domain expertise.

In this detailed guide, we’ll walk through the complete journey from beginner to expert — including:

🧩 Step-by-Step Mini Task Breakdown

Each section will build upon the previous and cover all the important blocks.

Understanding the core idea of fine-tuning
Selecting the right model (Mistral, LLaMA, Gemma)
Preparing your dataset for instruction/causal tuning
Explaining Tokenization and Tokenizer Alignment
Setting up environment (Colab, GPU, or local)
Understanding QLoRA, PEFT, and bitsandbytes
Writing a training script from scratch
Monitoring training with metrics/logs (WandB, TensorBoard)
Evaluating the model output (quantitative and qualitative)
Saving, exporting, and deploying models (Hugging Face, FastAPI)
Advanced topics: mixed precision, LoRA strategies, RAG + Fine-tune combo
Common issues and debugging tips

We’ll also provide multiple code examples, JSON schemas, real-world datasets, and visualization diagrams throughout.

🧠 1. Understanding the Core Idea of Fine-Tuning

Fine-tuning refers to the process of taking a pre-trained LLM and training it further on a domain-specific or task-specific dataset. While foundational models (like GPT, LLaMA, Mistral) are trained on diverse internet data, they are often too general for niche tasks like legal contract review or medical diagnosis.

Fine-tuning adapts the model to:

Use domain-specific vocabulary (e.g., legal, scientific terms)
Follow specific instruction patterns (e.g., Q&A, summarization)
Output in a desired tone or format (e.g., structured JSON or bullet points)

📊 Analogy

Imagine GPT-4o as a brilliant college graduate — it knows a lot but lacks job experience. Fine-tuning is like on-the-job training for a specific role, such as a paralegal or customer service agent.

🧬 What Happens Internally?

During fine-tuning:

The model’s internal weights are updated using gradient descent
These updates reflect the structure and semantics of your dataset
Depending on strategy (full fine-tuning vs LoRA), some or all of the weights are modified

🔁 Pretraining vs Fine-Tuning

Aspect	Pretraining	Fine-Tuning
Data	Billions of tokens (general)	Few thousand to millions (task/domain)
Goal	Learn language + world knowledge	Adapt to specific task or behavior
Compute cost	Extremely high	Manageable (can be done on a single GPU)

🔍 Why Not Just Use Prompt Engineering?

While prompting is fast, it hits limits when:

You want deterministic, consistent output
Prompts get long and unmanageable
Model doesn’t understand your domain well

Fine-tuning gives you a reliable and scalable solution.

🧠 Categories of Fine-Tuning

Instruction Fine-Tuning: Teach the model to follow instructions (e.g., FLAN, OpenAssistant style)
Causal Language Modeling (CLM): Continue a sequence of tokens (good for storytelling or completion)
Multi-turn Chat Fine-Tuning: Learn from conversations between user and assistant

In the next section, we'll choose the best base model for your task in 2025 and explain trade-offs between Mistral, LLaMA, and Gemma.

🏗️ 2. Selecting the Right Base Model in 2025

Choosing the optimal base model for fine-tuning is a make-or-break decision—impacting performance, cost, compliance, and scalability. With rapid advancements in open-weight models, here’s how to navigate the landscape in 2025.

✅ Key Selection Criteria

Factor	Why It Matters	Tradeoffs to Consider
Model Size	2B–7B: Fast inference, cost-efficient. 70B+: Higher accuracy.	Latency vs. precision for your use case.
License	Commercially safe? (Apache/MIT > LLaMA-3’s custom license)	Legal risk vs. model capability.
Architecture	Mistral’s grouped-query attention? LLaMA-3’s 8K context?	Hardware compatibility & quantization support.
Pre-Tuning	Instruction-tuned (e.g., Mistral-7B-Instruct) vs. base models	Faster deployment vs. customization potential.
Community	Active forks, vLLM/GGUF support, docs	Long-term maintainability & troubleshooting.

🔥 Top Open-Source Models for Fine-Tuning (2025)

🏆 Best All-Around: Mistral 7B/12B

Why: Balanced speed/accuracy, Apache 2.0 license, RAG-ready
Use Case: Chatbots, enterprise QA

🔍 Precision-First: LLaMA 3 70B

Why: SOTA reasoning, strong benchmarks
Caution: Meta’s license restricts SaaS usage
Use Case: Medical/legal summaries

⚡ Low-Cost/Edge: Google Gemma 2B/7B, Microsoft Phi-3

Gemma: GPU-light, ideal for mobile
Phi-3: Tiny but competitive in logic tasks

🧪 Experimental: OLMo 7B (Allen Institute)

Why: Fully open weights + training data
Use Case: Research and reproducibility

📦 Code: Loading Mistral-7B-Instruct

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

💡 Tip: Use 4-bit quantization via bitsandbytes to reduce memory by 70%

🧪 Pro Tips for Model Selection

Start Small: Fine-tune Phi-3 or Gemma 2B to validate pipeline first
License Audit: Avoid LLaMA-3 for SaaS unless compliant
Benchmark Early: Test perplexity on your domain data
Hybrid Option: Fine-tune a small model + augment with RAG

🤖 Model Match Guide

You Are...	Recommended Setup
Startup	Mistral 7B + LoRA
Enterprise	LLaMA 3 70B + full fine-tune
Edge/Offline App	Phi-3 or TinyLLaMA

Up Next: We’ll prepare your dataset for fine-tuning, including JSON templates for both single-turn instructions and multi-turn conversations.

Choosing the right model for fine-tuning is crucial to performance, cost, and licensing. Not all models are created equal — some are optimized for speed, some for size, others for instruction following.

✅ What to Consider Before Choosing:

Model Size (Parameters): Do you need a 2B, 7B, or 70B model?
License: Is it commercially usable? (Apache, MIT, LLaMA-style license?)
Architecture: Some support better quantization or tuning (Mistral)
Instruction-Tuned: Has the base model already been tuned for chat?
Community Support: Actively maintained? Tutorials available?

🔥 Recommended Open Source Models

Model Name	Size	License	Use Case Example
Mistral 7B	7B	Apache 2.0	General-purpose fine-tuning
Mixtral 8x7B MoE	12.9B act	Apache 2.0	High-quality with efficiency
Meta LLaMA 3 8B/70B	8B/70B	Research	High accuracy (license limits)
Google Gemma 2B/7B	2B/7B	Apache 2.0	Lightweight + good for mobile/dev
Phi-2	1.3B	MIT	Very small footprint
TinyLLaMA 1.1B	1.1B	MIT	Great for experimentation

📦 Example: Loading Mistral 7B Instruct

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

🧪 Model Selection Tips:

Start with Mistral or Gemma for commercial projects
Use LLaMA 3 if accuracy is more important than license
Use TinyLLaMA or Phi-2 for testing fine-tuning on low-resource systems

In the next section, we’ll prepare your custom dataset, including JSON formats for both instruction tuning and chat tuning.

🗃️ 3. Preparing Your Dataset for Instruction and Chat Tuning

Your dataset is the most critical component in fine-tuning. It directly affects how well your model learns, generalizes, and performs in your domain. Whether you're tuning a legal summarizer, a medical Q&A bot, or a code completion tool, data quality makes or breaks the outcome.

📂 Types of Fine-Tuning Datasets

There are two major formats used in open-source LLM fine-tuning:

Instruction Format: Single-turn prompt/response pairs
Chat Format: Multi-turn dialogues with alternating roles

Let’s explore both.

📄 Instruction Format (Single-Turn Examples)

This format teaches the model to complete specific instructions:

{
  "instruction": "Translate to French",
  "input": "How are you?",
  "output": "Comment ça va ?"
}

This is perfect for:

Summarization
Translation
Classification
Structured output (e.g., JSON generation)

📌 Tips:

Keep prompts clear and consistent
Ensure high-quality, diverse outputs
Vary the instruction types for generalization

💬 Chat Format (Multi-Turn Dialogues)

This format simulates real conversations with context:

{
  "conversations": [
    {"role": "user", "content": "What is quantum computing?"},
    {"role": "assistant", "content": "Quantum computing uses quantum bits to..."},
    {"role": "user", "content": "Why is it faster?"},
    {"role": "assistant", "content": "Because qubits can exist in multiple states..."}
  ]
}

Ideal for:

Customer support bots
Agent-like assistants
FAQ systems

📌 Tips:

Alternate strictly between user and assistant
Use real-world multi-turn conversations when possible
Avoid hallucinations or toxic data

🔍 Where to Source Data

Open Datasets: Alpaca, OpenAssistant, Dolly, ShareGPT
Internal Docs: Manuals, support chat logs, team FAQs
Synthetic Data: Bootstrap examples using GPT-4 or Mixtral

✅ Data Quality Checklist

Criteria	Why It’s Important
Diverse Instructions	Encourages generalization
Consistent Format	Prevents tokenizer & parsing issues
High Signal Outputs	Guides model behavior directly
Domain Relevance	Aligns generation with user needs
Clean Tokenization	Avoids garbage tokens during training

In the next section, we’ll dive into tokenization — the process of converting your dataset into input the model understands — and how to align tokenizers with your model architecture.

🔡 4. Tokenization and Tokenizer Alignment

Tokenization is the process of converting raw text into model-understandable tokens — typically integers mapped to subwords, words, or bytes.

Different models use different tokenizers. Aligning your tokenizer with the base model is essential to prevent token mismatch, loss of context, or wasted memory.

🧠 What Are Tokens?

Tokens are numeric representations of text. For example:

from transformers import AutoTokenizer

text = "Quantum computers are amazing!"
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokens = tokenizer.tokenize(text)
print(tokens)  # ['▁Quantum', '▁computers', '▁are', '▁amazing', '!']

⚙️ Tokenizer Types

Tokenizer Type	Model Examples	Notes
BPE (Byte Pair Encoding)	GPT-2, GPT-Neo	Fast, widely adopted
SentencePiece	T5, UL2, Gemma	Supports subwords & multilingual text
Tokenizer + Special Tokens	Mistral, LLaMA, OpenChat	Custom prompt formatting e.g., [INST]

📦 Hugging Face Tokenizer Alignment

Ensure you use the same tokenizer that your base model was trained on:

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

🛑 Don’t train with a different tokenizer — your model will misinterpret input and degrade in performance.

🧪 Pre-Tokenization Checks

Before training:

# Validate token count
inputs = tokenizer("Summarize this passage:", return_tensors="pt")
print(len(inputs["input_ids"][0]))

Use this to:

Ensure prompt + input fits your model’s context window (e.g., 8K, 32K)
Pad and truncate consistently
Reserve tokens for generation (e.g., max_length = 2048 - 256)

🔍 Special Tokens Handling

Most chat models like Mistral or LLaMA expect formatted prompts:

<|system|>
You are a helpful assistant.
<|user|>
Tell me a joke.
<|assistant|>

✅ Add tokens to tokenizer config if custom:

tokenizer.add_special_tokens({"additional_special_tokens": ["[INST]", "[/INST]"]})
model.resize_token_embeddings(len(tokenizer))

🔧 Advanced: Train Your Own Tokenizer (Optional)

For low-resource languages or custom formats:

pip install sentencepiece

from tokenizers import SentencePieceBPETokenizer

tokenizer = SentencePieceBPETokenizer()
tokenizer.train(files=["data.txt"], vocab_size=32000)
tokenizer.save_model("./tokenizer")

Use this only if your task/data diverges heavily from available vocabularies.

In the Next: We’ll set up the training environment — including GPU requirements, libraries like transformers, accelerate, peft, and quantization tools such as bitsandbytes.

🖥️ 5. Setting Up the Training Environment

A well-configured environment is the foundation of efficient fine-tuning. Whether you're running experiments on a local GPU or a cloud-based platform, the right setup can reduce costs, accelerate training, and prevent runtime errors. Below, we expand on hardware requirements, software tools, and best practices for smooth fine-tuning.

🧰 Required Tools & Libraries

Install these essential packages for fine-tuning LLMs (Large Language Models):

# Core libraries
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118  # CUDA 11.8
pip install transformers datasets accelerate peft bitsandbytes wandb

# Optional but useful
pip install einops scipy sentencepiece protobuf==3.20.3  # Fixes some Hugging Face conflicts
pip install tensorboard  # For training logs

Why These Packages?

transformers (Hugging Face) → Load & fine-tune models
datasets → Efficient data loading
accelerate → Multi-GPU/TPU training
peft (LoRA/QLoRA) → Memory-efficient fine-tuning
bitsandbytes → 4-bit & 8-bit quantization
wandb (Weights & Biases) → Track experiments

🧪 Hardware Requirements

Fine-tuning LLMs can be GPU-intensive, but optimizations like LoRA/QLoRA make it feasible on consumer hardware.

Model Size	Minimum GPU (FP16)	With LoRA/QLoRA (4-bit)
7B	24GB (A10G/3090)	10GB (3060/4090)
13B	48GB (A100)	16GB (A4000/4090)
30B+	80GB (A100/H100)	24GB (A5000/2x4090)

💡 Key Insight

QLoRA (4-bit) + LoRA allows fine-tuning 7B models on a single 10GB GPU (e.g., RTX 3090).
Multi-GPU training (accelerate) helps with larger models.

🧱 Recommended Directory Structure

Organizing files properly avoids confusion and ensures reproducibility.

project/
├── data/ # Training datasets (JSON, CSV, etc.)
│ ├── train.jsonl
│ └── test.jsonl
├── models/ # Pretrained & fine-tuned models
│ ├── base_model/ # Original (e.g., "mistral-7b")
│ └── lora_adapter/ # LoRA adapter weights
├── scripts/ # Training & inference scripts
│ ├── train.py
│ └── inference.py
├── logs/ # Training logs (W&B, TensorBoard)
└── requirements.txt # Python dependencies

⚡ Using Hugging Face Accelerate

accelerate simplifies multi-GPU/TPU training and mixed-precision training.

1. Configure Accelerate

Run:

accelerate config

Then select options like:

mixed_precision: bf16 (best for Ampere GPUs like A100/4090)
num_processes: 2 (for 2-GPU training)
main_process_port: 29500 (avoid port conflicts on shared servers)

2. Launch Training

accelerate launch scripts/train.py  # Uses your config

🔋 Quantization with `bitsandbytes`

4-bit/8-bit quantization drastically reduces memory usage.

Load a 4-bit Model

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,  # Apply 4-bit
    device_map="auto",  # Auto-distribute across GPUs
)

Why Quantization?

4-bit cuts memory by ~4x (7B model fits in ~6GB instead of 24GB).
Almost no accuracy loss when combined with LoRA.

📋 Environment YAML (Optional but Recommended)

For reproducible environments, use conda:

# environment.yml
name: llm-finetune
channels:
  - pytorch
  - conda-forge
dependencies:
  - python=3.10
  - pytorch=2.1.0
  - cudatoolkit=11.8
  - pip
  - pip:
      - transformers==4.36.0
      - datasets==2.15.0
      - peft==0.7.0
      - bitsandbytes==0.41.1
      - accelerate==0.25.0
      - wandb==0.16.0

Apply it with:

conda env create -f environment.yml 
conda activate llm-finetune

🚀 What’s Next?

In the next section, we’ll dive into:

QLoRA vs. LoRA → Which one to use?
PEFT (Parameter-Efficient Fine-Tuning) → Train only small adapters.
Real-world fine-tuning examples → Code walkthroughs.

🧠 6. Understanding QLoRA, PEFT, and Parameter-Efficient Fine-Tuning

Let’s be real—fine-tuning massive LLMs like LLaMA or Mistral sounds great until you see the GPU requirements. Training a 7B model traditionally needs 24GB+ VRAM, and forget about 70B models unless you’ve got a server farm.

That’s where PEFT (Parameter-Efficient Fine-Tuning) comes in. Instead of updating all the model’s weights (which is slow and expensive), PEFT tweaks just a tiny fraction of parameters. The result? You can fine-tune a 7B model on a single consumer GPU (even a 10GB RTX 3080) with minimal performance loss.

💡 Why Use PEFT?

1. You Don’t Need a Supercomputer

Full fine-tuning → Updates every single weight (billions of parameters).
PEFT (LoRA/QLoRA) → Updates only 0.1% of weights, slashing GPU memory by 4-10x.

2. Keep Your Original Model Intact

PEFT trains small adapter layers separately.
Your base model stays unchanged—like swapping out a brain module instead of retraining the whole brain.

3. Merge Later If Needed

Once trained, you can merge adapters back into the base model for a single, optimized file.

🔁 PEFT Techniques (Which One to Use?)

Method	Best For	GPU Savings	Difficulty
LoRA	General fine-tuning	2-4x	Easy
QLoRA	Extreme memory savings	4-10x	Moderate
Adapters	Task-specific tuning	3-5x	Moderate
Prefix Tuning	Lightweight prompts	2-3x	Hard

LoRA (Low-Rank Adaptation)

How it works: Adds tiny "adapter" layers to the model (like training a mini-model on top).
Pros: Simple, works well for most tasks.
Cons: Still needs ~16GB VRAM for 7B models.

QLoRA (Quantized LoRA)

How it works: 4-bit quantization + LoRA = 7B models on 10GB GPUs.
Pros: Lets you fine-tune on cheap hardware.
Cons: Slightly slower than pure LoRA.

🧪 QLoRA in Practice

QLoRA (Quantized LoRA) enables fine-tuning using 4-bit precision with LoRA layers on top.

Install the PEFT and sandbytes library:

pip install peft bitsandbytes accelerate

Basic setup (In a 4-bit Mode):

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,               # 4-bit quantization
    bnb_4bit_quant_type="nf4",       # Optimized 4-bit format
    bnb_4bit_compute_dtype="float16" # Faster computation
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,   # Apply 4-bit
    device_map="auto"                 # Auto GPU/CPU placement
)

You’ll notice only ~0.1% of parameters are now trainable!

Add LoRA Adapters

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,             # Rank (smaller = less VRAM)
    target_modules=["q_proj", "v_proj"],  # Layers to tweak
    lora_alpha=32,   # Scaling factor
    lora_dropout=0.05, # Prevents overfitting
    bias="none",     # Don't train bias terms
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~0.1% of params!

Output

trainable params: 4,194,304 (~0.1%)  
all params: 7,000,000,000

🧠 How to Choose Between LoRA and QLoRA

Criteria	Use LoRA	Use QLoRA
GPU Memory ≥ 40GB	✅	✅
GPU Memory < 24GB	❌ (likely to OOM)	✅
Model Size ≥ 13B	❌	✅
Inference + Tuning	✅ (easy to merge)	✅ (but needs quantized inference)

In the next article (i.e. the Part-2), we’ll write a training script to fine-tune your LLM using Hugging Face Trainer, LoRA adapters, and quantization config together.