Fine tuning and methods


Introduction to Fine-Tuning

Fine-tuning is a powerful technique in machine learning and deep learning that allows you to adapt a pre-trained model to a new, specialized task. This approach not only accelerates the development process but also enables high performance with limited data and computational resources.

What is Fine-Tuning?

  • Definition: Fine-tuning is the process where a pre-trained model (trained on a large, general dataset) is further trained using a smaller, task-specific dataset. The model’s existing knowledge is leveraged, with only selected parameters updated during training to adjust to the new task.

  • Purpose: It bridges the gap between a model’s general knowledge and the specific requirements of the desired application.

Why is Fine-Tuning Important?

  • Saves Resources: Requires less data and computing power compared to training from scratch.

  • Improves Performance: Achieves better results on niche or domain-specific tasks by leveraging broad patterns learned on larger datasets.

  • Enables Adaptability: Facilitates transfer of knowledge from one domain to another (transfer learning).

Typical Fine-Tuning Workflow

  1. Select a Pre-Trained Model: Choose a model that has already learned useful features from a large dataset (e.g., BERT for text, ResNet for images).

  2. Prepare Task-Specific Data: Collect and preprocess data relevant to the new task.

  3. Modify Model Layers: Often, the final layers are replaced or unfrozen for adaptation, while earlier layers may be kept fixed.

  4. Train on New Data: Retrain the model, usually with a lower learning rate to avoid catastrophic forgetting of previously learned features.

  5. Evaluate Performance: Validate the model on a held-out dataset and tune further if necessary.

Common Applications

  • Natural Language Processing: Adapting general language models to sentiment analysis, question answering, or specialized text domains.

  • Computer Vision: Tailoring image classifiers for specific categories, medical imaging tasks, or unique visual datasets.

  • Speech Recognition & More: Customizing voice models for accents, languages, or noisy environments.

Key Benefits

  • Data Efficiency: Achieves strong performance with limited labeled examples.

  • Speed: Reduces model development and deployment time.

  • Customizability: Allows rapid prototyping for varied downstream tasks.

Fine-tuning forms the backbone of modern machine learning workflows, particularly for organizations seeking practical solutions with limited data or for tasks where obtaining large annotated datasets is challenging.


Transfer Learning

Transfer learning is a foundational concept in modern machine learning, enabling faster and more efficient development of high-performing models—especially when data or computational resources are limited.

What is Transfer Learning?

  • Definition: Transfer learning is a technique where a model trained for one task is repurposed for a different, yet related, task. Instead of training a new model from scratch, a pre-trained model’s knowledge (weights and features) is leveraged and adapted to the new context.

  • How it Works: The earlier layers of the pre-trained model—trained on vast, general datasets—are kept intact (often frozen), while new layers or a small subset of parameters are trained specifically for the new task.

Why Use Transfer Learning?

  • Data Efficiency: Requires less labeled data for the new task, as many useful features have already been learned.

  • Computational Savings: Reduces the need for extensive retraining, saving time and resources.

  • Performance: Models often achieve higher accuracy and robustness on new tasks, benefitting from “generalizable” features learned previously

Workflow of Transfer Learning

  1. Select a Pre-Trained Model: Choose a model that has already learned from a large dataset (e.g., ImageNet for images, BERT for text).

  2. Prepare the New Dataset: Collect and preprocess data for the target task.

  3. Adapt Model Architecture: Replace or add new layers specific to the new objective.

  4. Train (Fine-Tune) the Model: Train only the new layers or a small subset of parameters using a lower learning rate, reducing the risk of losing valuable pre-learned knowledge.

  5. Evaluate and Deploy: Test the adapted model on the new task and deploy upon satisfactory performance

Applications of Transfer Learning

  • Computer Vision: Adapting a model trained on general object detection to medical imaging or satellite imagery

  • Natural Language Processing: Transferring knowledge from language models to tasks like sentiment analysis, chatbots, or domain-specific text classification.

  • Speech and Audio: Using pre-trained models for language identification, speaker recognition, or voice assistants.

Transfer Learning vs. Fine-Tuning

AspectTransfer LearningFine-Tuning
Layers UpdatedUsually only new/final layersSome or all layers may be updated
Data NeededWorks well with small datasetsNeeds more data for effective retraining
Computation CostLower (fewer parameters trained)Higher (more layers updated)
FlexibilityGood for similar tasksBetter for more different tasks
AdaptabilityLimited (mainly modifies classifier layers)More (can adjust feature extraction too)

Fine-tuning is considered a step beyond transfer learning, where more layers of the pre-trained model are updated to better fit the new task, especially if the new data differs significantly from the original training data.

Key Benefits

  • Faster Development: Leverages existing work for new tasks.

  • Cost-Effective: Less need for large datasets or compute resources.

  • Better Generalization: More robust to data variability.

Transfer learning is a mainstay in fields where labeled data is scarce, rapid prototyping is valuable, or models must be tailored to specialized domains. Its success underpins the rapid progress made across computer vision, NLP, and other AI domain.


Types of finetuning

1. Categorization by Parameter Scope

This category focuses on how much of the model's internal structure is actually modified.

Full Fine-Tuning

The model’s entire set of weights is updated during training.

  • Best for: Massive domain shifts (e.g., teaching a general model to understand complex legal or medical jargon from scratch).

  • Pros: Maximum performance and adaptability.

  • Cons: Extremely high compute/memory costs; high risk of Catastrophic Forgetting (where the model loses its original general knowledge).

Parameter-Efficient Fine-Tuning (PEFT)

Only a tiny fraction (often <1%) of the parameters are trained, while the rest are "frozen."

  • LoRA (Low-Rank Adaptation): Injects small, trainable "rank" matrices into the model layers. It is currently the industry standard for fine-tuning LLMs on consumer hardware.

  • Adapters: Injects small new layers between existing transformer blocks.

  • Prompt/Prefix Tuning: Learns a "soft prompt" (a sequence of continuous vectors) that is prepended to the input, rather than changing the weights themselves.


2. Categorization by Objective

This category focuses on what the model is being taught to do.

Supervised Fine-Tuning (SFT)

The most common form of fine-tuning where the model is trained on labeled input-output pairs (e.g., "Question: X, Answer: Y").

  • Instruction Tuning: A subset of SFT where the dataset consists of instructions (e.g., "Summarize this text," "Write a Python script"). This transforms a base model into an assistant.

Alignment & Preference Fine-Tuning

Used to ensure the model's outputs are helpful, honest, and harmless (HHH).

  • RLHF (Reinforcement Learning from Human Feedback): A complex three-step process involving human rankings, a separate Reward Model, and PPO (Proximal Policy Optimization).

  • DPO (Direct Preference Optimization): A more efficient 2024-2026 favorite that replaces RLHF by directly optimizing the model on "Better vs. Worse" pairs without needing a separate reward model.

Domain Adaptation

Fine-tuning the model on a large corpus of unstructured text from a specific industry (e.g., financial reports, engineering manuals) to improve its internal "world knowledge" of that niche.


TypeData RequiredCompute NeedPrimary Use Case
Full Fine-TuningLarge (10k+ samples)Very HighCreating a domain-specific "Titan" model.
LoRA / QLoRAMedium (500–5k samples)LowBuilding specialized chatbots or task-specific tools.
Instruction TuningHigh Quality (Varied)MediumMaking a raw base model "chat-ready."
DPO / RLHFPreference PairsMedium-HighReducing hallucinations and improving safety.
Feature ExtractionSmallMinimalSimple classification or embedding generation.
Fine tuning Frameworks

Below are the leading frameworks categorized by their primary strength.


1. High-Performance / Speed-Focused

These frameworks are designed to make fine-tuning as fast and memory-efficient as possible, often utilizing custom kernels.

  • Unsloth:

    • What it is: A lightweight library that wraps Hugging Face’s TRL and PEFT.

    • Key Advantage: It uses manually written Triton kernels to speed up training by 2x–5x and reduce memory usage by up to 70%.

    • Best For: Individuals or researchers training on consumer GPUs (e.g., RTX 3090/4090) or free Google Colab instances.

    • Supported Models: Llama (3.1/3.2/4), Mistral, Phi-4, and Gemma.

  • Llama-Factory:

    • What it is: An "all-in-one" framework that provides a unified CLI and a Web UI (LlamaBoard).

    • Key Advantage: It supports over 100 models and integrates almost every modern technique (LoRA, QLoRA, DPO, ORPO, GaLore) without requiring you to write a single line of Python code.

    • Best For: Beginners who want a GUI or teams that need to experiment rapidly across different models and datasets.


2. Configuration & Reproducibility

These frameworks focus on making experiments easy to share, version, and scale using configuration files.

  • Axolotl:

    • What it is: A config-driven framework that uses YAML files to define every aspect of the training run.

    • Key Advantage: Excellent for reproducibility. You can share a single .yaml file, and anyone else can recreate your exact training environment. It handles complex data tokenization and multi-GPU setups (via FSDP or DeepSpeed) out of the box.

    • Best For: Serious practitioners and engineering teams building production-grade models.


3. The Foundation (Hugging Face Ecosystem)

Most high-level frameworks are built on top of these core libraries.

  • TRL (Transformer Reinforcement Learning): The go-to library for SFT (Supervised Fine-Tuning) and alignment techniques like DPO (Direct Preference Optimization) and RLHF.

  • PEFT (Parameter-Efficient Fine-Tuning): The industry standard for implementing LoRA, AdaLoRA, and Prefix Tuning.

  • Accelerate: A library that allows the same PyTorch code to run on a single CPU, a single GPU, or massive multi-GPU/TPU clusters.


4. Scalability & Distributed Training

If you are fine-tuning models larger than 70B parameters or using hundreds of GPUs, these "back-end" frameworks are essential.

FrameworkBest ForKey Logic
DeepSpeed (ZeRO)Massive ScaleShards model states, gradients, and parameters across GPUs. Supports "Offloading" to CPU/NVMe if GPU memory is full.
PyTorch FSDPSpeed in 2026Native to PyTorch; often faster than DeepSpeed for models <70B due to deeper integration with the autograd engine.

Summary of frameworks

FrameworkDifficultyHardware NeedsPrimary Use Case
UnslothEasySingle GPU (8GB+)Fast, local experimentation.
Llama-FactoryVery EasySingle/Multi GPURapid prototyping via Web UI.
AxolotlMediumMulti-GPU / ClusterProduction pipelines / YAML configs.
Hugging Face (Direct)HardAnyCustom researchers building new architectures.
OpenAI APITrivialNone (Cloud)Teams with high budget & low infra expertise.

Step-by-Step Process of Fine-Tuning

Fine-tuning is a practical and systematic procedure that adapts a pre-trained model to perform well on a new task. Below is a breakdown of each step involved:

1. Select a Pre-Trained Model

  • Choose a model that has been trained on a large, general dataset relevant to your domain.

    • Examples: BERT for text, ResNet for images, Whisper for audio.

2. Define the Target Task

  • Specify the new task or problem for which you want the model to perform, such as:

    • Sentiment analysis

    • Medical image classification

    • Named entity recognition

3. Prepare Your Data

  • Collect and preprocess your domain-specific data.

    • Clean data and format it to match the model’s input requirements.

    • Split the dataset into training, validation, and test sets.

4. Freeze and Update Model Layers

  • Decide which layers to “freeze” (prevent from further training) and which to “unfreeze” (allow updates):

    • For similar tasks, freeze most layers and only update the top layers.

    • For less similar or more complex tasks, unfreeze and update more of the model.

5. Fine-Tuning Training

  • Train the model using the prepared data:

    • Set a smaller learning rate to avoid overwriting learned knowledge.

    • Tune other hyperparameters such as batch size and number of epochs.

    • Use techniques like early stopping to prevent overfitting.

6. Evaluate Model Performance

  • Assess the fine-tuned model’s performance on the validation and test sets.

    • Monitor metrics relevant to the new task (accuracy, F1 score, etc.).

    • Refine model parameters or unfreeze more layers if necessary.

7. Deployment and Monitoring

  • Deploy the fine-tuned model for real-world inference or integration.

  • Continuously monitor performance to detect potential model drift.

  • Plan for periodic updates or re-training if the target domain evolves.

Process Summary Table

StepPurposeNotes
Select Pre-Trained ModelLeverage powerful, general featuresSaves time and resources
Define Target TaskFocus adaptation effortsClarifies data and evaluation methods
Prepare DataEnsure data quality and task relevanceEssential for successful adaptation
Freeze/Update LayersControl what the model learns anewPrevents losing useful general knowledge
Fine-Tuning TrainingAdapt to new task with low learning ratePreserves earlier knowledge, minimizes overfitting
Evaluate PerformanceConfirm model capability on new taskGuides further tuning
Deployment/MonitoringMove to real-world usage and ongoing supportMaintains performance over time


 
Example BERT model fine tuning

Below is a clean, end-to-end example to fine-tune BERT for a text classification task using Hugging Face Transformers.
This is interview + production friendly and easy to understand.


📌 Use case

Binary sentiment classification:

  • 1 → Positive

  • 0 → Negative


1️⃣ Install required libraries

pip install transformers datasets torch scikit-learn

2️⃣ Import libraries

import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset

3️⃣ Sample dataset

data = {
    "text": [
        "I love this product",
        "This is a bad experience",
        "Amazing service",
        "Worst purchase ever"
    ],
    "label": [1, 0, 1, 0]
}

dataset = Dataset.from_dict(data)

4️⃣ Load tokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

5️⃣ Tokenization function

def tokenize(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

dataset = dataset.map(tokenize, batched=True)
dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

6️⃣ Load pre-trained BERT model

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)

7️⃣ Training arguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)

8️⃣ Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

9️⃣ Fine-tune BERT

trainer.train()

🔍 How fine-tuning works (important concept)

  • Pre-trained BERT already knows language

  • We add a classification head

  • Backprop updates:

    • BERT encoder weights (slightly)

    • Classification head (mostly)


10️⃣ Inference example

text = "I really enjoyed this movie"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1)
print(prediction.item())

🧠 Interview explanation (one-liner)

“I fine-tune BERT by adding a task-specific classification head and training it using labeled data with a small learning rate.”


⚡ Common fine-tuning tips

  • Learning rate: 2e-5 or 3e-5

  • Freeze BERT layers if dataset is small

  • Use GPU for speed

  • Use Trainer for quick setup


🔥 Bonus: Freeze BERT encoder (optional)

for param in model.bert.parameters():
    param.requires_grad = False

📌 When to use BERT fine-tuning

✔ Text classification
✔ Sentiment analysis
✔ Spam detection
✔ Intent classification


Knowledge Distillation (KD) is a model compression technique where a smaller, efficient "Student" model is trained to mimic the behavior and performance of a larger, more complex "Teacher" model.

The goal is not just to copy the final output, but to transfer the "dark knowledge"—the nuanced relationships between classes that the teacher has learned.


1. The Core Mechanism: Teacher-Student Framework

In traditional training, a model learns from "hard targets" (e.g., a label is either 0 or 1). In distillation, the student learns from the teacher's "soft targets" (probabilities).

  • Teacher Model: A large, pre-trained model (e.g., Llama-3 70B) with high accuracy but high latency.

  • Student Model: A smaller architecture (e.g., Llama-3 8B or a custom 1B model) designed for speed.

  • The Softmax Temperature ($T$): To reveal "dark knowledge," a temperature parameter $T > 1$ is used in the Softmax layer. This smooths the probability distribution, making the "incorrect" classes more visible so the student can learn why the teacher thought a "Cat" looked slightly like a "Dog."


2. Types of Knowledge Distillation

Distillation can be classified by what is being transferred from the teacher to the student:

TypeWhat is Transferred?Use Case
Logit-based (Response)The final output probabilities (logits).Most common; used for general classification/text gen.
Feature-basedInternal representations (hidden layers/activations).When you want the student to "think" like the teacher.
Relation-basedThe relationships between different data points.Useful for embedding models or contrastive learning.
API-based (Black-box)Only the text outputs (since weights are hidden).Distilling from proprietary models like GPT-4o or Claude.

3. Distillation Training Schemes

How the teacher and student interact during the training process:

  1. Offline Distillation: The teacher is fully trained and frozen. We pre-calculate its outputs on a dataset and use them to train the student. (Most common).

  2. Online Distillation: Both the teacher and student are updated simultaneously. This is useful when a high-quality pre-trained teacher isn't available.

  3. Self-Distillation: A single model acts as its own teacher. Deep layers teach shallower layers, or the model learns from its own previous checkpoints to improve stability.


4. Distillation vs. Other Optimization Techniques

FeatureDistillationQuantizationPruning
StrategyTrain a new, smaller model.Reduce precision of weights (e.g., 16-bit to 4-bit).Remove redundant neurons/connections.
ArchitectureChanges (usually smaller).Stays the same.Stays the same (but sparser).
TrainingRequires full re-training.Minimal to no re-training (PTQ/QAT).Requires fine-tuning after.
Inference GainHigh (fewer operations).High (faster hardware math).Moderate (requires sparse hardware).

5. Modern Trends (2025–2026)

  • CoT (Chain-of-Thought) Distillation: Instead of just distilling the answer, the teacher (like OpenAI's o1) distills its "reasoning steps" into the student. This allows small models to gain advanced logic capabilities.

  • Step-Distillation (Diffusion): In image generation, distilling a 50-step diffusion process into a 1-to-4 step process (e.g., SDXL-Turbo) for near-instant generation.

  • RLAIF (RL from AI Feedback): Using a large model to rank the student's outputs, which are then used to fine-tune the student via DPO (Direct Preference Optimization).


6. Summary Checklist for Implementation

  • [ ] Select Teacher: High-performing, usually 10x larger than the student.

  • [ ] Define Loss: Use a combination of Distillation Loss (KL Divergence between teacher/student) and Student Loss (Cross-Entropy with ground truth).

  • [ ] Set Temperature: Experiment with $T$ between 2.0 and 5.0 for best results.

  • [ ] Evaluate: Check if the student maintains at least 90-95% of the teacher's performance.



Quantization is the process of reducing the precision of a model's weights and activations to make it smaller, faster, and more energy-efficient. In the context of 2026 AI engineering, it is the primary bridge that allows trillion-parameter models to run on consumer hardware or edge devices.


1. Core Concept: From Floats to Integers

Most models are trained using FP32 (32-bit floating point) or BF16/FP16 (16-bit). Quantization maps these continuous, high-precision values to a discrete set of lower-precision values, usually INT8 (8-bit) or INT4 (4-bit).

  • The Analogy: If FP32 is a high-resolution 4K video, INT4 is a compressed 480p version. It’s significantly smaller, but if the compression (quantization) is done correctly, the "picture" (model intelligence) remains clear.


2. The Mathematics of Linear Quantization

The most common form is Affine Quantization, which uses two parameters to map values: Scale ($S$) and Zero-point ($Z$).

$$x_{float} = S \times (x_{quantized} - Z)$$
  • Scale ($S$): A positive floating-point number that defines the "step size."

  • Zero-point ($Z$): An integer that represents the value $0$ from the floating-point space in the quantized space.

  • Clipping: Any value outside the representable range (e.g., -128 to 127 for INT8) is "clipped" to the nearest boundary.


3. PTQ vs. QAT: When to Quantize?

FeaturePost-Training Quantization (PTQ)Quantization-Aware Training (QAT)
TimingDone after the model is fully trained.Done during training or fine-tuning.
ComplexityLow; requires a small "calibration" dataset.High; requires full training pipeline.
AccuracyRisk of "quantization error" for < 6-bit.Best accuracy; model learns to compensate.
Ideal ForQuick deployment of existing models.Small models (<7B) where every bit counts.

4. Types of Modern Quantization

A. Weight-Only Quantization

Only the model weights are quantized; activations stay in FP16/BF16.

  • Benefit: Massive reduction in VRAM (Memory) requirements.

  • Popular for: LLMs where memory is the bottleneck (e.g., running a 70B model on a single GPU).

B. Static vs. Dynamic Quantization

  • Dynamic: The Scale and Zero-point for activations are calculated on-the-fly for each batch. It’s slower but more accurate.

  • Static: Scales are pre-calculated using a calibration dataset. It’s faster during inference but requires a representative dataset to avoid accuracy drops.

C. LLM-Specific Formats (4-bit & Below)

Modern LLM frameworks use advanced algorithms to minimize accuracy loss:

  • GPTQ: A layer-wise quantization method that minimizes the error in output between the full-precision and quantized layer.

  • AWQ (Activation-aware Weight Quantization): Protects the "salient" weights (the ones most important for model performance) by keeping them at higher precision or scaling them differently.

  • NF4 (Normal Float 4): A specialized 4-bit format (used in QLoRA) that assumes weights follow a normal distribution, providing better accuracy than standard INT4.


5. Popular Frameworks & Libraries

In 2026, these are the standard tools used to quantize models:

  • bitsandbytes: The Hugging Face standard for 8-bit and 4-bit (NF4) loading. Used heavily for QLoRA fine-tuning.

  • AutoGPTQ / AutoAWQ: Specialized for creating highly optimized 4-bit models for inference.

  • GGUF (llama.cpp): The standard for CPU-based inference and Apple Silicon. It allows "partial" quantization (e.g., 5.5-bit).

  • TensorRT-LLM: NVIDIA’s framework for ultra-fast, production-grade INT8/FP8 quantization on H100/B200 GPUs.

  • Unsloth: A 2024-2026 favorite that uses custom kernels to make 4-bit training 2x faster than standard methods.


6. Summary Comparison

PrecisionModel Size (7B Model)Accuracy LossHardware
FP16~14 GB0% (Baseline)A100 / H100
INT8~7 GBVery Low (<0.5%)Most modern GPUs
INT4~4 GBLow (1-3%)Consumer (RTX 3060+)
INT2~2 GBHigh (Significant)Experimental / Edge


Hugging face full notes

Hugging Face has become the "GitHub of Machine Learning," providing a centralized platform for sharing models, datasets, and demo applications. Its ecosystem is divided into two primary parts: The Hub (where data and models live) and The Libraries (the tools used to build and train).


1. The Hugging Face Hub (The Platform)

The Hub is the collaborative heart of the ecosystem, hosting millions of open-source assets.

  • Models: Hosts pre-trained weights for LLMs (Llama, BERT), Vision models (ViT, Stable Diffusion), and Audio models (Whisper).

  • Datasets: A vast library of structured and unstructured data (Text, Image, Audio, Tabular) optimized for fast loading.

  • Spaces: A hosting service for machine learning demos. You can build a UI using Gradio or Streamlit and host it for free on Hugging Face’s infrastructure.

  • Model Cards & Dataset Cards: Documentation for every asset, detailing how it was trained, its limitations, and ethical considerations.


2. Core Libraries (The Toolkit)

Hugging Face provides specialized Python libraries that handle different stages of the ML lifecycle.

Transformers

The flagship library. It abstracts the complexity of implementing transformer-based architectures.

  • The pipeline() API: The easiest way to use a model for inference in one line of code (e.g., pipeline("sentiment-analysis")).

  • AutoClasses: Automatically loads the correct model architecture and configuration based on the model ID (e.g., AutoModelForCausalLM).

  • The Trainer API: A high-level training loop that handles boilerplate code like device placement (CPU/GPU) and evaluation.

Datasets

Designed to handle massive amounts of data without crashing your RAM.

  • Memory Mapping: Uses Apache Arrow to stream data directly from the disk, allowing you to work with datasets larger than your local memory.

  • One-line Loading: load_dataset("glue", "mrpc") handles the download and formatting automatically.

Tokenizers

The bridge between human text and machine-readable numbers.

  • Speed: Built in Rust for extreme performance.

  • Consistency: Ensures that the text you use for inference is processed exactly like the text used during the model’s original training.


3. Advanced Training & Alignment Libraries

As LLM development has advanced, Hugging Face released specialized libraries for fine-tuning and scaling.

LibraryPurposeKey Feature
PEFTEfficiencyImplements LoRA and Adapter methods to fine-tune models on consumer hardware.
AccelerateScalingRun the exact same training code on a single GPU or a massive 100-GPU cluster with 4 lines of code.
TRLAlignmentUsed for SFT, DPO, and RLHF to make models follow instructions better.
EvaluateMetricsProvides standardized scripts to calculate Accuracy, F1-Score, BLEU, and ROUGE.

4. Standard Workflow (The "Hugging Face Way")

  1. Find: Search the Hub for a pre-trained model (e.g., meta-llama/Llama-3.1-8B).

  2. Load: Use AutoTokenizer and AutoModel to bring it into your environment.

  3. Process: Use the datasets library to map your raw data into token IDs.

  4. Train: Use the Trainer (with Accelerate for speed) to fine-tune the model.

  5. Push: Save your fine-tuned model back to the Hub with push_to_hub().


Quick Command Reference

Bash
# Install the essential stack
pip install transformers datasets accelerate tokenizers evaluate peft trl


Adapters & LoRA in Fine-Tuning

Adapters and LoRA (Low-Rank Adaptation) are two leading parameter-efficient fine-tuning techniques that allow large language models (LLMs) and neural networks to efficiently adapt to new tasks without retraining or modifying the entire model. Both are widely adopted in the era of massive pre-trained models.

What Are Adapters?

  • Adapters are small trainable neural modules inserted into each layer (or selected layers) of a pre-trained model.

  • During fine-tuning, only the adapters are updated, while the original model’s weights are kept frozen.

  • Typically, an adapter has a bottleneck architecture: it projects the output of a layer down to a lower dimension, applies a non-linearity, then projects it back up to the original size. This design introduces far fewer parameters than the original layer, making the process efficient.

  • The adapters can be swapped in and out, allowing for multi-task and continual learning where each task has its own adapter file, minimizing storage and maintenance overhead.

Key Benefits:

  • Significantly reduces the number of trainable parameters.

  • Prevents catastrophic forgetting (loss of previously learned knowledge).

  • Supports serving multiple tasks/domains from a single base model.

What Is LoRA?

  • LoRA (Low-Rank Adaptation) is a specialized adapter method that introduces a low-rank matrix decomposition into selected weight matrices of a neural network (often linear layers of transformers).

  • It freezes the original model weights and learns only the additional low-rank matrices during fine-tuning.

  • At inference, LoRA adapters can be efficiently merged with or applied to the base model without additional burden on memory or speed.

  • The main hyperparameter is the "rank," which controls adapter size and fine-tuning precision. Higher rank = more expressiveness, more trainable params.

Key Benefits:

  • Enables large models to be fine-tuned on standard GPUs, offering cost and resource savings.

  • Offers nearly full-model-adaptation quality while training only a small subset of weights.

  • Multiple LoRA adapters can be used with a single base model for varied specializations.

Comparison Table

AspectAdaptersLoRA
LayersSmall neural modules in transformerLow-rank matrices in linear layers
Params TrainedAdapter modules onlyLow-rank adapter matrices only
Base ModelFrozenFrozen
Inference CostSlight increase (extra layers)Negligible (modifies existing layers)
Multi-TaskDynamic adapter swappingMultiple LoRA files per task
Use CasesMulti-task/continual/domain learningTask-specific adaptations, resource-limited fine-tuning
Notable ToolsAdapterHub, PEFTHugging Face PEFT, vLLM, Keras LoRA APIs

Practical Use Cases
  • Adapters: Ideal for scenarios requiring frequent switching between tasks, or where model versioning simplicity is crucial. Common in multi-domain services and continual learning deployments.

  • LoRA: Preferred for rapid, cost-effective fine-tuning on specific tasks, especially when compute is limited, or multiple model specializations are needed on a shared base.



Parameter-Efficient Fine-Tuning (PEFT) is an umbrella term for techniques that allow you to adapt a large pre-trained model to a specific task by updating only a tiny fraction of its parameters.

Instead of retraining billions of weights (Full Fine-Tuning), PEFT "freezes" the original model and either adds a few new parameters or updates a specific subset. This reduces memory usage by up to 90% and storage by 99%.


1. Major PEFT Categories

Techniques are generally grouped into three architectural strategies:

  • Additive Methods: Inserting new trainable "modules" or "layers" into the existing architecture (e.g., Adapters).

  • Selective Methods: Identifying and training only a specific subset of the existing parameters (e.g., BitFit, which only tunes bias terms).

  • Reparameterization-based Methods: Transforming the weight updates into a low-dimensional space using matrix factorization (e.g., LoRA).


2. Key Techniques Explained

LoRA (Low-Rank Adaptation)

LoRA is currently the most popular PEFT technique. It assumes that the change in weights during fine-tuning has a "low intrinsic rank."

  • How it works: Instead of updating the original weight matrix $W$ of size $d \times k$, LoRA represents the update $\Delta W$ as the product of two much smaller matrices, $A$ and $B$.

    • $W_{updated} = W_{frozen} + (B \times A)$

    • Where $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$, and $r$ (the rank) is very small (e.g., 4 or 8).

  • Benefit: Zero inference latency. Since $(B \times A)$ has the same dimensions as $W$, the matrices can be merged back into the base model after training.

QLoRA (Quantized LoRA)

An evolution of LoRA that allows you to fine-tune massive models on a single consumer GPU (like an RTX 4090).

  • How it works: It loads the base model in 4-bit NormalFloat (NF4) precision and uses a technique called "Double Quantization" to save even more memory. The LoRA adapters themselves are still trained in 16-bit to maintain accuracy.

  • Benefit: You can fine-tune a 70B parameter model on a single 48GB GPU.

Adapters

Adapters were the first widely used PEFT technique.

  • How it works: Small "bottleneck" layers are inserted between the existing layers of the Transformer (usually after the Attention or Feed-Forward blocks). During training, only these tiny blocks are updated.

  • Trade-off: Unlike LoRA, Adapters add extra layers to the model, which can slightly increase inference time (latency).

Prompt Tuning & Prefix Tuning

These techniques focus on the input or the hidden states rather than the model weights.

  • Prompt Tuning: You prepended "soft tokens" (vectors of numbers) to the user's input. These vectors are learned during training to "steer" the model toward the right task.

  • Prefix Tuning: Similar to prompt tuning, but it prepended learnable vectors to every layer's hidden states, giving the model more "reminders" of the task at each stage of processing.


3. Comparison of PEFT Techniques

TechniqueTrainable ParamsInference LatencyBest Use Case
Full FT100%NoneDrastic domain shifts (e.g., English to Legal).
LoRA0.1% – 1%None (Merged)Most general-purpose fine-tuning.
QLoRA< 0.1%MinimalFine-tuning large models on limited hardware.
Adapters1% – 3%Slight IncreaseMulti-task learning (swapping adapters).
Prompt Tuning< 0.01%NoneSimple task steering (e.g., sentiment).

4. Why use PEFT instead of Full Fine-Tuning?

  1. Memory Efficiency: Avoid "Out of Memory" (OOM) errors on standard GPUs.

  2. Storage Efficiency: Instead of saving a 140GB checkpoint for every task, you save a 100MB "adapter" file.

  3. No Catastrophic Forgetting: Because the original weights are frozen, the model doesn't lose its general knowledge.

  4. Modular Deployment: You can keep one "Base Model" in memory and hot-swap different LoRA adapters for different customers or tasks instantly.



Below is a clear, end-to-end, practical example showing how to go from a PDF document → instruction fine-tuning → preference alignment fine-tuning for a LLaMA-style model.

I’ll keep it production-oriented + interview-ready, and I’ll clearly separate the three phases you mentioned.


🔷 Overall Pipeline (Big Picture)

PDF Docs
   ↓
Text Extraction
   ↓
Non-instruction Dataset
   ↓
Instruction Fine-Tuning (SFT)
   ↓
Preference Alignment (DPO / RLHF-style)

1️⃣ STEP 1: Extract Text from PDF (Non-Instruction Data)

Install

pip install pypdf transformers datasets peft trl accelerate bitsandbytes

Extract text from PDF

from pypdf import PdfReader

def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

raw_text = extract_text_from_pdf("document.pdf")

Convert raw text into non-instruction format

chunks = raw_text.split("\n\n")

dataset = [{"text": chunk.strip()} for chunk in chunks if len(chunk) > 50]

This is NOT instruction-tuned yet — just raw domain knowledge.


2️⃣ STEP 2: Convert to Instruction Fine-Tuning Dataset (SFT)

Instruction format (VERY IMPORTANT)

LLaMA-style models expect:

### Instruction:
### Input:
### Response:

Create instruction examples

def make_instruction_data(chunks):
    data = []
    for chunk in chunks:
        data.append({
            "instruction": "Explain the following content clearly",
            "input": chunk["text"],
            "output": chunk["text"]
        })
    return data

instruction_data = make_instruction_data(dataset)

Convert to training text

def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

3️⃣ STEP 3: Instruction Fine-Tune LLaMA (SFT)

We use LoRA (mandatory in real world).


Load LLaMA Model (QLoRA)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
import torch

model_name = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

Apply LoRA

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Dataset & Trainer (SFT)

from datasets import Dataset
from trl import SFTTrainer

train_dataset = Dataset.from_list(instruction_data)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    formatting_func=lambda x: format_prompt(x),
    max_seq_length=2048
)

trainer.train()

✅ At this point → Instruction-tuned LLaMA on your PDF


4️⃣ STEP 4: Preference Alignment Fine-Tuning (DPO)

Now we align behavior, not knowledge.


Preference Dataset Format

preference_data = [
    {
        "prompt": "Explain transformer attention",
        "chosen": "A clear, structured explanation...",
        "rejected": "A vague or incorrect explanation..."
    }
]

Load model again (from SFT checkpoint)

from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=Dataset.from_list(preference_data),
    beta=0.1
)

Run Preference Alignment

dpo_trainer.train()

This:

  • Encourages chosen responses

  • Penalizes rejected responses

  • No reward model needed (unlike RLHF)


🧠 Conceptual Difference (INTERVIEW GOLD)

PhaseWhat changes
Non-instructionJust domain text
Instruction SFTModel learns to follow tasks
Preference AlignmentModel learns how to respond

🎯 Interview One-Liner

“I fine-tune LLaMA by first converting PDFs into instruction-response pairs for SFT, then aligning model behavior using preference optimization like DPO.”


🔥 Production Tips

  • Use chunking (512–2048 tokens)

  • Always use LoRA / QLoRA

  • Prefer DPO over RLHF unless needed

  • Freeze base weights



Below is a side-by-side, production-grade walkthrough showing the same PDF → Instruction SFT → Preference Alignment pipeline using:

1️⃣ Unsloth
2️⃣ Axolotl
3️⃣ LLaMA-Factory

This is exactly how teams do it in real projects and perfect for interviews.


🔷 COMMON STEP (For ALL 3)

PDF → Instruction Dataset (shared)

from pypdf import PdfReader

def extract_text(pdf):
    reader = PdfReader(pdf)
    return "\n".join(p.extract_text() for p in reader.pages)

text = extract_text("doc.pdf")

chunks = [c.strip() for c in text.split("\n\n") if len(c) > 200]

data = [
    {
        "instruction": "Explain the following content clearly",
        "input": chunk,
        "output": chunk
    }
    for chunk in chunks
]

Save as:

[
  {
    "instruction": "...",
    "input": "...",
    "output": "..."
  }
]

1️⃣ USING UNSLOTH (Fastest, GPU-friendly)

👉 Best for local laptops / Colab / fast experiments


Install

pip install unsloth transformers trl datasets

Load LLaMA with Unsloth

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-2-7b-hf",
    max_seq_length=2048,
    load_in_4bit=True,
)

Format prompt

def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

Instruction Fine-Tuning (SFT)

from trl import SFTTrainer
from datasets import Dataset

dataset = Dataset.from_list(data)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    formatting_func=format_prompt,
    max_seq_length=2048,
)
trainer.train()

Preference Alignment (DPO)

from trl import DPOTrainer

preference_data = [
    {
        "prompt": "Explain transformers",
        "chosen": "Clear explanation...",
        "rejected": "Bad explanation..."
    }
]

dpo_trainer = DPOTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=Dataset.from_list(preference_data),
    beta=0.1,
)

dpo_trainer.train()

Unsloth advantage:
• 2–5× faster
• Fits 7B on consumer GPUs


2️⃣ USING AXOLOTL (Config-Driven, Production)

👉 Best for reproducible training & teams


Install

pip install axolotl

Dataset format (JSONL)

{"instruction": "...", "input": "...", "output": "..."}

Axolotl Config (config.yml)

base_model: meta-llama/Llama-2-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

load_in_4bit: true
adapter: lora

datasets:
  - path: data.jsonl
    type: alpaca

lora_r: 8
lora_alpha: 16
lora_dropout: 0.05

sequence_len: 2048
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-5

Run Instruction Fine-Tuning

accelerate launch -m axolotl.cli.train config.yml

Preference Alignment (DPO in Axolotl)

rl:
  rl_type: dpo
  beta: 0.1

Run again:

accelerate launch -m axolotl.cli.train config.yml

Axolotl advantage:
• YAML-based
• Easy scaling
• Used in startups & OSS LLMs


3️⃣ USING LLAMA-FACTORY (Most Complete Toolkit)

👉 Best for end-to-end LLM lifecycle


Install

git clone https://github.com/hiyouga/LLaMA-Factory
cd LLaMA-Factory
pip install -r requirements.txt

Dataset (Alpaca format)

{
  "instruction": "...",
  "input": "...",
  "output": "..."
}

Instruction Fine-Tune (CLI)

python src/train_bash.py \
  --stage sft \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --dataset data \
  --template llama \
  --finetuning_type lora \
  --output_dir output_sft \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 3

Preference Alignment (DPO)

python src/train_bash.py \
  --stage dpo \
  --model_name_or_path output_sft \
  --dataset preference_data \
  --template llama \
  --finetuning_type lora \
  --output_dir output_dpo \
  --beta 0.1

LLaMA-Factory advantage:
• SFT + DPO + PPO
• UI + CLI
• Most enterprise-ready


🧠 Comparison Table (INTERVIEW GOLD)

ToolBest Use Case
UnslothFast local fine-tuning
AxolotlReproducible training
LLaMA-FactoryFull LLM lifecycle

🎯 Interview One-Liner

“I fine-tune LLaMA from PDFs using instruction SFT with Unsloth or Axolotl, and align behavior using DPO via LLaMA-Factory or TRL.”


🔥 What Interviewers LOVE if you say

  • “I prefer DPO over PPO

  • “I use LoRA / QLoRA only

  • “Instruction tuning teaches what to answer, preference tuning teaches how to answer


Comments

Popular posts from this blog

Resume Work and Project Details

Time Series and MMM basics

LINEAR REGRESSION