PEFT / LoRA

From llmref.wiki
PEFT / LoRA — Trainable parameter reduction technique applying low-rank decomposition to model weights during fine-tuning, minimizing compute and memory overhead.

Overview

Parameter-Efficient Fine-Tuning (PEFT) encompasses methods that adapt pre-trained models to downstream tasks by training only a small fraction of the total model parameters, rather than updating all weights. Low-Rank Adaptation (LoRA) is the most widely deployed PEFT approach, introduced by Hu et al.[1] It works by injecting trainable low-rank decomposition matrices into a foundation model's weight layers, leaving the original pretrained weights frozen.

The motivation for PEFT arises from the prohibitive cost of updating large language models during fine-tuning. A model with billions of parameters requires proportional GPU memory and computation time for gradient calculation and optimizer state storage. LoRA reduces the trainable parameter count by 1–2 orders of magnitude while achieving comparable downstream task performance to full fine-tuning, making LLM adaptation accessible on modest hardware and reducing latency during deployment.

PEFT methods are agnostic to the underlying model architecture and can be applied to transformer-based models, diffusion models, and other neural architectures. Beyond LoRA, the PEFT landscape includes adapter layers, prefix tuning, and prompt-based methods, though LoRA has become the de facto standard in production systems due to its simplicity, efficiency, and composability across model variants.

How it works

LoRA operates by factorizing weight updates into a product of two low-rank matrices. For a weight matrix W₀ ∈ ℝ^(d_out × d_in) in the pretrained model, the update during forward pass is computed as:

W = W₀ + ΔW = W₀ + BA

where B ∈ ℝ^(d_out × r) and A ∈ ℝ^(r × d_in) are trainable matrices and r (the rank) satisfies r ≪ min(d_out, d_in). Typically, r ranges from 8 to 64 for models with millions of parameters. Only A and B are updated during training; W₀ remains frozen.

During inference, the low-rank update can be merged with the base weights (W = W₀ + BA), incurring no additional latency overhead compared to the original model. Alternatively, the matrices can remain separate, allowing efficient switching between task-specific LoRA modules without storing duplicate base model weights—a critical advantage for multi-task deployment.

Training proceeds via standard backpropagation through the decomposed update matrices. The gradient with respect to input activations is computed via W₀'s fixed weights and the dynamically updated BA product. This approach reduces inference memory requirements for the optimizer state (Adam maintains two moment estimates per trainable parameter) and enables batch inference on single GPUs or edge devices where full fine-tuning would be infeasible.

LoRA is typically applied to projection matrices (query, value, key, and output) in transformer attention layers; applying it to all weight matrices yields marginal improvements at increased computational cost. Hyperparameters include rank r, the learning rate scaling factor α, and regularization (weight decay), with r and α selected via validation on task-specific datasets.

Distinction from related terms

Term Distinction
Full Fine-tuning Updates all model parameters; requires gradient storage proportional to total parameter count. LoRA trains only low-rank updates (typically 0.01–1% of parameters), reducing memory and compute by 10–100×.
In-context Learning Uses fixed model weights and context window to adapt behavior; no training required. LoRA requires training data and a training phase, but produces persistent weight adaptations usable across many inputs.
Adapter Inserts trainable modules (often MLPs) between frozen transformer layers. LoRA instead applies low-rank factorization directly to existing weight matrices, typically achieving better parameter efficiency and simpler deployment.
Prompt Engineering Modifies input text to guide model behavior; no learning required. LoRA learns task-specific weight distributions from examples, generally outperforming prompts on specialized tasks but requiring labeled training data.
Quantization Reduces precision of weights (e.g., FP32 → INT8) without retraining; applies uniformly across layers. LoRA combines with quantization (QLoRA) for further savings, but operates orthogonally—one compresses static weights, the other minimizes trainable parameters.

Examples

Hugging Face's PEFT library integrates LoRA as the standard adapter for open-source models; it has been applied to LLaMA, Mistral, and other architectures for tasks including text classification, summarization, and instruction following. A typical workflow trains a rank-8 LoRA adapter on a consumer GPU (24 GB VRAM) on a 10k-example instruction dataset in 2–4 hours, whereas full fine-tuning would require a multi-GPU cluster.

OpenAI's fine-tuning API for GPT-3.5 and GPT-4 operates via PEFT-like mechanisms, exposing a simplified interface where users upload training datasets and the platform optimizes trainable parameter count, rank selection, and learning schedules automatically. Published examples include adapting models for domain-specific Q&A and coding tasks.

Research on constitutional AI and RLHF workflows frequently employ LoRA to train reward models and policy adapters on top of frozen base models, reducing the cost of iterative human evaluation and automated evaluation cycles. Studies report that LoRA-adapted models trained on 1k examples match or exceed full fine-tuning on standard benchmarks, validating efficiency gains without sacrificing task performance.

See also

References

  1. Hu, Edward et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685, 2021.