Attention mechanism

Attention mechanism — Learned, content-based weighting mechanism that scores relevance between sequence positions, enabling models to selectively focus on distant context.

Overview

Attention mechanism is a computational operation that computes content-based weights over sequence positions, allowing neural networks to dynamically prioritize relevant information regardless of distance. First introduced in machine translation contexts, attention enables models to move beyond sequential processing by establishing direct connections between any two positions in an input or output sequence.^[1]

The mechanism operates by computing query, key, and value representations from input tokens, then using the query-key dot product to generate alignment scores that are normalized into probability weights. These weights are applied to value representations to produce context-aware outputs that integrate information from multiple sequence positions. This architecture became foundational to LLMs and transformer-based systems, enabling efficient modeling of long-range dependencies that sequential architectures struggle to capture.

Attention mechanisms differ from traditional neural architectures in that they make no sequential assumptions about input processing. Information at position *i* can directly influence the output at position *j* through a single weighted computation, rather than requiring information to propagate through intermediate steps. This property is critical for context window management and for enabling models to scale to longer documents.

How it works

Attention operates in three primary steps:

1. Projection: Input tokens are transformed into three representations:

Query (Q): Represents what the model is looking for
Key (K): Represents what each position offers
Value (V): Represents the actual content to aggregate

2. Scoring: For each query position, a similarity score is computed against all key positions using the scaled dot product: score(Q, K) = QK^T / √d_k, where d_k is the key dimension.^[2]

3. Normalization and aggregation: Scores are normalized using softmax to produce weights in [0,1], which are then applied to values: Attention(Q, K, V) = softmax(QK^T / √d_k)V.

In multi-head attention, this operation is repeated in parallel with different learned projections, allowing the model to attend to different semantic subspaces simultaneously. The outputs are concatenated and projected to produce the final attention output.

Attention weights are learnable through backpropagation; the query, key, and value projection matrices are trained parameters that evolve to capture task-relevant alignment patterns. This learned weighting is what distinguishes attention from fixed pooling or concatenation strategies.

Distinction from related terms

Term	Distinction
Embeddings	Embeddings are static input representations; attention computes dynamic, context-dependent weights over embeddings. Embeddings convert tokens to vectors; attention determines which vectors to prioritize.
Context window	Context window defines the maximum sequence length the model can process; attention mechanism is the technique that makes effective use of that window by weighting positions.
In-context learning	In-context learning describes the model's ability to adapt behavior based on examples in the input; attention is the architectural mechanism that enables the model to reference and weight those examples.
Self-attention vs. cross-attention	Self-attention computes weights over positions within a single sequence. Cross-attention computes weights from one sequence (decoder) over another (encoder), used in sequence-to-sequence models.
Prompt caching	Prompt caching stores pre-computed attention outputs for repeated prompt prefixes to reduce latency. Attention mechanism is the underlying operation being cached.
Knowledge graphs	Knowledge graphs are structured representations of facts; attention mechanisms are computational operations that may be applied over knowledge graph embeddings or retrieve from them.

Examples

Transformer language models (GPT, Claude, LLaMA): These models use stacked multi-head attention layers across the entire architecture. Each layer's attention computes which previous tokens are most relevant for predicting the next token, enabling the model to maintain coherence over thousands of tokens within the context window.

Machine translation (seq2seq with attention): In encoder-decoder architectures, the decoder uses cross-attention to align with source-language tokens when generating translations.^[1] For example, when translating "the cat sat," the decoder's attention weights for "le chat" peak over positions corresponding to "cat" and "the," enabling accurate reordering across languages.

Retrieval-augmented generation (RAG): Attention mechanisms weight retrieved documents or passages by computing similarities between the query and each candidate context, allowing the model to selectively integrate relevant information. This is foundational to RAG systems and contextual retrieval pipelines.

References

↑ ^1.0 ^1.1 Bahdanau, D., Cho, K., & Bengio, Y. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. https://arxiv.org/abs/1409.0473
↑ Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS 2017. https://arxiv.org/abs/1706.03762

[bahdanau-1] 1.0 ^1.1 Bahdanau, D., Cho, K., & Bengio, Y. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. https://arxiv.org/abs/1409.0473

[vaswani-2] Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS 2017. https://arxiv.org/abs/1706.03762

[1]

[2]

Anonymous

Search

Attention mechanism

Namespaces

More

Page actions

Contents

Overview

How it works

Distinction from related terms

Examples

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Attention mechanism

Overview

How it works

Distinction from related terms

Examples

See also

References

Navigation

Wiki tools

Page tools

Categories