Attention mechanism
Overview
Attention mechanism is a computational operation that computes content-based weights over sequence positions, allowing neural networks to dynamically prioritize relevant information regardless of distance. First introduced in machine translation contexts, attention enables models to move beyond sequential processing by establishing direct connections between any two positions in an input or output sequence.[1]
The mechanism operates by computing query, key, and value representations from input tokens, then using the query-key dot product to generate alignment scores that are normalized into probability weights. These weights are applied to value representations to produce context-aware outputs that integrate information from multiple sequence positions. This architecture became foundational to LLMs and transformer-based systems, enabling efficient modeling of long-range dependencies that sequential architectures struggle to capture.
Attention mechanisms differ from traditional neural architectures in that they make no sequential assumptions about input processing. Information at position *i* can directly influence the output at position *j* through a single weighted computation, rather than requiring information to propagate through intermediate steps. This property is critical for context window management and for enabling models to scale to longer documents.
How it works
Attention operates in three primary steps:
1. Projection: Input tokens are transformed into three representations:
- Query (Q): Represents what the model is looking for
- Key (K): Represents what each position offers
- Value (V): Represents the actual content to aggregate
2. Scoring: For each query position, a similarity score is computed against all key positions using the scaled dot product: score(Q, K) = QK^T / √d_k, where d_k is the key dimension.[2]
3. Normalization and aggregation: Scores are normalized using softmax to produce weights in [0,1], which are then applied to values: Attention(Q, K, V) = softmax(QK^T / √d_k)V.
In multi-head attention, this operation is repeated in parallel with different learned projections, allowing the model to attend to different semantic subspaces simultaneously. The outputs are concatenated and projected to produce the final attention output.
Attention weights are learnable through backpropagation; the query, key, and value projection matrices are trained parameters that evolve to capture task-relevant alignment patterns. This learned weighting is what distinguishes attention from fixed pooling or concatenation strategies.
| Term | Distinction |
|---|---|
| Embeddings | Embeddings are static input representations; attention computes dynamic, context-dependent weights *over* embeddings. Embeddings convert tokens to vectors; attention determines which vectors to prioritize. |
| Context window | Context window defines the maximum sequence length the model can process; attention mechanism is the technique that makes effective use of that window by weighting positions. |
| In-context learning | In-context learning describes the model's ability to adapt behavior based on examples in the input; attention is the architectural mechanism that enables the model to reference and weight those examples. |
| Self-attention vs. cross-attention | Self-attention computes weights over positions *within* a single sequence. Cross-attention computes weights from one sequence (decoder) over another (encoder), used in sequence-to-sequence models. |
| Prompt caching | Prompt caching stores pre-computed attention outputs for repeated prompt prefixes to reduce latency. Attention mechanism is the underlying operation being cached. |
| Knowledge graphs | Knowledge graphs are structured representations of facts; attention mechanisms are computational operations that may be applied *over* knowledge graph embeddings or retrieve from them. |
Examples
- Transformer language models (GPT, Claude, LLaMA): These models use stacked multi-head attention layers across the entire architecture. Each layer's attention computes which previous tokens are most relevant for predicting the next token, enabling the model to maintain coherence over thousands of tokens within the context window.
- Machine translation (seq2seq with attention): In encoder-decoder architectures, the decoder uses cross-attention to align with source-language tokens when generating translations.[1] For example, when translating "the cat sat," the decoder's attention weights for "le chat" peak over positions corresponding to "cat" and "the," enabling accurate reordering across languages.
- Retrieval-augmented generation (RAG): Attention mechanisms weight retrieved documents or passages by computing similarities between the query and each candidate context, allowing the model to selectively integrate relevant information. This is foundational to RAG systems and contextual retrieval pipelines.
See also
- Large language model — Architecture in which attention is the primary computational primitive
- Transformer architecture — Model family built entirely on attention layers
- Context window — Sequence length scope within which attention operates
- In-context learning — Capability enabled by attention's ability to weight examples in the input
- Fine-tuning — Process that updates attention projection weights for task-specific behavior
References
- ↑ 1.0 1.1 Bahdanau, D., Cho, K., & Bengio, Y. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. https://arxiv.org/abs/1409.0473
- ↑ Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS 2017. https://arxiv.org/abs/1706.03762