Fine-tuning

From llmref.wiki
Fine-tuning — Updating a pre-trained model's weights on a smaller task-specific dataset to adapt its behavior.

Overview

Fine-tuning is the process of taking a pre-trained large language model and continuing its training on a smaller, task-specific or domain-specific dataset. Unlike training from scratch, which requires enormous computational resources and data, fine-tuning leverages the general knowledge already encoded in the foundation model's weights and selectively adjusts them to perform well on downstream tasks.

Fine-tuning is positioned between two other adaptation methods: in-context learning, which requires no weight updates and relies entirely on prompt examples, and prompt engineering, which modifies input instructions without changing model parameters. Fine-tuning introduces permanent changes to the model through gradient descent on task-specific data, making it suitable when in-context performance is insufficient and computational budget permits retraining.

The scope of fine-tuning varies widely in practice. Full fine-tuning updates all model parameters; parameter-efficient approaches update only a subset (such as adapter layers or low-rank decompositions). The size of the fine-tuning dataset can range from hundreds to millions of examples, though effective fine-tuning often occurs with datasets substantially smaller than those used for pre-training.

How it works

Fine-tuning follows the standard supervised learning pipeline: (1) initialize the model with pre-trained weights, (2) load the task-specific dataset, (3) forward-pass input examples through the model to compute loss against ground-truth labels, (4) backpropagate gradients, and (5) update weights using an optimizer (typically Adam or SGD) with a learning rate lower than pre-training rates.

Key technical considerations include:

  • Learning rate selection: Fine-tuning learning rates are typically 1–2 orders of magnitude smaller than pre-training rates to avoid catastrophic forgetting of foundational knowledge while still permitting adaptation.
  • Early stopping: Overfitting on small datasets is common; validation loss monitoring and early stopping prevent degradation on out-of-distribution data.
  • Tokenization compatibility: The fine-tuning dataset must be tokenized using the same tokenizer as the pre-trained model, or retraining the tokenizer is required.
  • Batch size and epochs: Smaller datasets often require careful tuning of batch size and epoch count to balance convergence and regularization.

The computational cost of fine-tuning scales with model size, dataset size, and sequence length, but remains orders of magnitude smaller than pre-training the same model.

Distinction from related terms

Term Distinction
In-context learning In-context learning provides examples in the prompt without updating weights; fine-tuning modifies weights permanently. In-context learning is zero-cost at inference; fine-tuning requires retraining but achieves stronger performance on narrow tasks.
Prompt engineering Prompt engineering modifies input text and system prompts only; fine-tuning modifies model parameters. Prompt engineering is reversible and requires no computation; fine-tuning is permanent per checkpoint and computationally expensive.
Retrieval-augmented generation RAG injects factual knowledge at inference time via external retrieval; fine-tuning encodes knowledge into parameters during training. RAG is suitable for current or domain-specific facts; fine-tuning is suitable for behavioral or stylistic adaptation.
Pre-training Pre-training trains on massive, unlabeled or weakly labeled corpora with general objectives (next-token prediction); fine-tuning trains on smaller, labeled task-specific data with supervised loss functions. Pre-training requires months and millions of GPU hours; fine-tuning requires hours to days.
Foundation model training Foundation model training produces a base checkpoint from scratch or from another foundation model; fine-tuning assumes an existing foundation model checkpoint. Foundation model training establishes knowledge cutoff; fine-tuning does not update it.

Examples

  • OpenAI's fine-tuning API: Users can fine-tune GPT-3.5 or GPT-4 models on custom datasets (hundreds to thousands of examples) for specific classification, generation, or style tasks. The resulting fine-tuned checkpoint is deployed as a private model variant.
  • Domain-specific BERT adaptation: SciBERT and BioBERT fine-tuned BERT on scientific abstracts and biomedical literature respectively, achieving higher semantic search precision on domain literature than the original BERT pre-trained on general text.
  • Instruction-following via supervised fine-tuning: Models like Llama 2 and Mistral were fine-tuned on curated instruction-response pairs to improve adherence to system prompts and user intent, a process that made them suitable for agent deployment despite smaller parameter counts than their pre-trained base models.

See also

References