Fine-tuning
Overview
Fine-tuning is the process of taking a pre-trained large language model and continuing its training on a smaller, task-specific or domain-specific dataset. Unlike training from scratch, which requires enormous computational resources and data, fine-tuning leverages the general knowledge already encoded in the foundation model's weights and selectively adjusts them to perform well on downstream tasks.
Fine-tuning is positioned between two other adaptation methods: in-context learning, which requires no weight updates and relies entirely on prompt examples, and prompt engineering, which modifies input instructions without changing model parameters. Fine-tuning introduces permanent changes to the model through gradient descent on task-specific data, making it suitable when in-context performance is insufficient and computational budget permits retraining.
The scope of fine-tuning varies widely in practice. Full fine-tuning updates all model parameters; parameter-efficient approaches update only a subset (such as adapter layers or low-rank decompositions). The size of the fine-tuning dataset can range from hundreds to millions of examples, though effective fine-tuning often occurs with datasets substantially smaller than those used for pre-training.
How it works
Fine-tuning follows the standard supervised learning pipeline: (1) initialize the model with pre-trained weights, (2) load the task-specific dataset, (3) forward-pass input examples through the model to compute loss against ground-truth labels, (4) backpropagate gradients, and (5) update weights using an optimizer (typically Adam or SGD) with a learning rate lower than pre-training rates.
Key technical considerations include:
- Learning rate selection: Fine-tuning learning rates are typically 1–2 orders of magnitude smaller than pre-training rates to avoid catastrophic forgetting of foundational knowledge while still permitting adaptation.
- Early stopping: Overfitting on small datasets is common; validation loss monitoring and early stopping prevent degradation on out-of-distribution data.
- Tokenization compatibility: The fine-tuning dataset must be tokenized using the same tokenizer as the pre-trained model, or retraining the tokenizer is required.
- Knowledge cutoff: Fine-tuning does not update the knowledge cutoff; it refines task-specific behavior rather than injecting new factual information. For factual updates, retrieval-augmented generation is more suitable.
- Batch size and epochs: Smaller datasets often require careful tuning of batch size and epoch count to balance convergence and regularization.
The computational cost of fine-tuning scales with model size, dataset size, and sequence length, but remains orders of magnitude smaller than pre-training the same model.
| Term | Distinction |
|---|---|
| In-context learning | In-context learning provides examples in the prompt without updating weights; fine-tuning modifies weights permanently. In-context learning is zero-cost at inference; fine-tuning requires retraining but achieves stronger performance on narrow tasks. |
| Prompt engineering | Prompt engineering modifies input text and system prompts only; fine-tuning modifies model parameters. Prompt engineering is reversible and requires no computation; fine-tuning is permanent per checkpoint and computationally expensive. |
| Retrieval-augmented generation | RAG injects factual knowledge at inference time via external retrieval; fine-tuning encodes knowledge into parameters during training. RAG is suitable for current or domain-specific facts; fine-tuning is suitable for behavioral or stylistic adaptation. |
| Pre-training | Pre-training trains on massive, unlabeled or weakly labeled corpora with general objectives (next-token prediction); fine-tuning trains on smaller, labeled task-specific data with supervised loss functions. Pre-training requires months and millions of GPU hours; fine-tuning requires hours to days. |
| Foundation model training | Foundation model training produces a base checkpoint from scratch or from another foundation model; fine-tuning assumes an existing foundation model checkpoint. Foundation model training establishes knowledge cutoff; fine-tuning does not update it. |
Examples
- OpenAI's fine-tuning API: Users can fine-tune GPT-3.5 or GPT-4 models on custom datasets (hundreds to thousands of examples) for specific classification, generation, or style tasks. The resulting fine-tuned checkpoint is deployed as a private model variant.
- Domain-specific BERT adaptation: SciBERT and BioBERT fine-tuned BERT on scientific abstracts and biomedical literature respectively, achieving higher semantic search precision on domain literature than the original BERT pre-trained on general text.
- Instruction-following via supervised fine-tuning: Models like Llama 2 and Mistral were fine-tuned on curated instruction-response pairs to improve adherence to system prompts and user intent, a process that made them suitable for agent deployment despite smaller parameter counts than their pre-trained base models.
See also
- Large language model
- Foundation model
- In-context learning
- Prompt engineering
- Retrieval-augmented generation
- Knowledge cutoff