Quantization (model)
Overview
Quantization is a model compression technique that reduces the bit-width of weights, activations, and sometimes gradients in neural networks. Instead of storing parameters in full-precision floating-point format (typically FP32, 32-bit IEEE 754), quantization converts them to lower-precision representations such as FP16 (16-bit), INT8 (8-bit integer), INT4 (4-bit), or even lower.[1] This reduction in numerical precision directly decreases model size, memory bandwidth requirements, and computational cost during inference, making it a critical technique for deploying large language models on resource-constrained hardware.
The technique emerged as a fundamental response to the scaling challenges of large language models. As model parameters grew from billions to trillions, memory and energy costs became prohibitive for practical deployment. Quantization trades off a small amount of model accuracy for substantial gains in speed and efficiency—a tradeoff that empirical research has shown to be highly favorable for most applications.[2]
Quantization can be applied at different stages: post-training quantization (applied after model training completes) or quantization-aware training (performed during fine-tuning to adapt weights to lower precision). The choice depends on accuracy requirements, available computational resources, and the target hardware platform.
How it works
Quantization operates through a mapping function that scales continuous floating-point values to discrete integer bins. For symmetric quantization, the mapping is:
- <math>q = \text{round}\left(\frac{x}{s}\right)</math>
where x is the original value, s is a scale factor (often determined per-layer or per-channel), and q is the quantized integer. During inference, the quantized weights are dequantized back to floating-point before computation, or—in integer-only execution—the entire inference pipeline operates in integer arithmetic.
Post-training quantization analyzes the distribution of weights and activations after training completes and selects appropriate scale factors and clipping ranges, typically using calibration on a small representative dataset. This approach is fast and requires no retraining, but may suffer greater accuracy loss.
Quantization-aware training simulates quantization during fine-tuning, allowing the optimizer to learn weights that remain robust under lower precision. The quantization operation is inserted into the forward pass, and gradients flow through the simulated quantization, effectively teaching the model to be quantization-tolerant.[3]
Common quantization schemes include:
- Uniform quantization: bins are evenly spaced across the range [min, max].
- Non-uniform quantization: bin widths are adjusted based on the distribution of values (e.g., log-scale for exponential distributions).
- Per-layer quantization: a single scale factor is used for all weights in a layer.
- Per-channel quantization: separate scale factors for each output channel, improving accuracy at modest computational cost.
Modern implementations such as GPTQ and AWQ (Activation-Aware Weight Quantization) combine mixed-precision schemes, keeping certain layers or attention heads at higher precision while aggressively quantizing others.
| Term | Distinction |
|---|---|
| Quantization vs. Pruning | Pruning removes parameters entirely; quantization reduces their precision. Pruning changes sparsity; quantization preserves density but uses fewer bits per parameter. |
| Quantization vs. Prompt Caching | Prompt caching stores reusable computation graphs to avoid redundant inference; quantization reduces the bit-width of the computation itself. They address different efficiency bottlenecks and are complementary. |
| Quantization vs. Fine-tuning | Fine-tuning retrains a model on task-specific data to improve accuracy. Quantization reduces precision to improve efficiency. Quantization-aware training combines both. |
| Quantization vs. Distillation | Distillation transfers knowledge from a large model to a small one, reducing parameters; quantization reduces precision while keeping the same model. Distillation typically yields better accuracy but requires a teacher model. |
| Symmetric vs. Asymmetric Quantization | Symmetric quantization is centered on zero; asymmetric allows arbitrary ranges. Asymmetric often better matches natural distributions of activations. |
Examples
- GPTQ (Generative Pretrained Transformer Quantization)
- GPTQ is a post-training quantization method that performs one-shot quantization on a large language model using a small calibration dataset. It reduces models like OPT-175B to 3-bit or 4-bit precision with minimal accuracy loss, enabling deployment on consumer GPUs.[4] The technique is widely used in open-source model serving frameworks.
- Llama 2 INT8 Quantization
- Meta's Llama 2 models are routinely quantized to INT8 for production inference on mobile and edge devices, reducing model size by 75% while maintaining acceptable performance on standard benchmarks and real-world tasks.
- AWQ (Activation-Aware Weight Quantization)
- AWQ introduces per-channel, mixed-precision quantization that preserves activations important for output quality while aggressively quantizing less-critical channels. Models quantized to INT4 with AWQ show near-FP16 performance on downstream tasks, enabling efficient batch inference on hardware with limited memory.
See also
- LLM Optimization — broader strategies for reducing model compute and memory
- Inference infrastructure — systems and hardware for deploying quantized models
- Foundation model — the types of models typically subject to quantization
- Fine-tuning — preparation and adaptation techniques that can be combined with quantization
- Latency vs throughput (LLM) — performance metrics affected by quantization
References
- ↑ Jacob et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR 2018.
- ↑ Zhou et al. "Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations." arXiv:1609.07061 2016.
- ↑ Frantar et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.