Transformer architecture
Overview
The transformer architecture is a deep learning model introduced by Vaswani et al. in 2017 that replaces recurrence and convolutions entirely with self-attention mechanisms.[1] It forms the structural foundation of modern large language models including GPT, BERT, and Claude-series systems. The architecture's core innovation—the self-attention mechanism—allows the model to dynamically weight relationships between all positions in an input sequence simultaneously, rather than processing tokens sequentially as in recurrent architectures.
Transformers consist of an encoder-decoder structure (though decoder-only variants are more common in contemporary LLMs). Each component stacks multiple identical layers, where each layer combines multi-head self-attention with position-wise feedforward networks. Token embeddings are augmented with positional encodings to preserve sequence order information, since the parallel processing of attention removes inherent sequential bias. This design enables efficient context window expansion and permits training on substantially larger datasets than prior architectures could accommodate.
The transformer's parallelizable nature—computing attention over all sequence positions at once—made it feasible to train models at unprecedented scale. This scalability directly enabled the emergence of foundation models and underpinned subsequent advances in in-context learning, instruction tuning, and reinforcement learning from human feedback.
How it works
The transformer processes input sequences through the following principal mechanisms:
Tokenization and Embeddings: Input text is first converted to tokens via tokenization, then mapped to high-dimensional vectors (embeddings). Positional encodings—typically sinusoidal functions or learned vectors—are added to preserve sequence position information.
Self-Attention: At the heart of each transformer layer, the self-attention mechanism computes three derived representations for each token: query (Q), key (K), and value (V). Attention scores are computed as scaled dot-products between queries and keys, normalized via softmax, then used to weight values:
<math>\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V</math>
This allows the model to attend to any prior or subsequent token position with learned strengths. Multi-head attention runs this mechanism in parallel across multiple representation subspaces, enabling the model to attend to different types of relationships simultaneously.
Feedforward Networks: Following attention, each layer contains a position-wise feedforward network—two fully connected layers with a non-linear activation (typically ReLU or GELU)—applied identically to each sequence position.
Layer Normalization and Residual Connections: Each sublayer output is normalized and combined with its input via residual skip connections, stabilizing training and gradient flow through deep stacks.
Decoder Variants: In decoder-only transformers (standard for modern LLMs), self-attention is causally masked—each token can only attend to positions at or before itself—enforcing left-to-right generation. This masking is critical for instruction-tuned and RLHF-trained models that generate responses sequentially.
The context window length determines the maximum sequence length the model can process; attention computation scales quadratically with sequence length, creating both computational and memory constraints that determine practical window sizes.
| Term | Distinction |
|---|---|
| Transformer vs. Recurrent Neural Network (RNN/LSTM) | Transformers process entire sequences in parallel via attention; RNNs process tokens sequentially, making them slower to train but historically more memory-efficient for very long sequences. Transformers now dominate LLM design. |
| Transformer vs. Convolutional Neural Network (CNN) | Transformers use self-attention with global receptive fields from the first layer; CNNs use local convolution kernels requiring many layers to achieve global context. Transformers are now preferred for language; CNNs remain dominant in vision outside multimodal architectures. |
| Self-attention vs. Cross-attention | Self-attention compares tokens within the same sequence to each other; cross-attention (in encoder-decoder models) allows the decoder to attend to encoder outputs. RAG systems use cross-attention to integrate retrieved documents. |
| Transformer (architecture) vs. Transformer (model) | The transformer architecture is the underlying computational design; a "transformer model" is a specific instantiation (e.g., GPT-3) trained on specific data. Foundation models are large transformers trained on broad corpora. |
| Decoder-only vs. Encoder-Decoder | Decoder-only transformers (GPT-style) use causally masked self-attention for generation; encoder-decoder models (BERT-style, T5) separate encoding and generation phases. Modern LLMs predominantly use decoder-only. |
Examples
- GPT Series (OpenAI): Decoder-only transformer models trained on broad internet text. GPT-3 (175 billion parameters) and GPT-4 exemplify scaling transformer architectures for in-context learning and instruction-following. Architectural innovations include sparse attention patterns and mixture-of-experts variants in larger versions.
- BERT (Google): Encoder-only transformer trained via masked language modeling on bidirectional context. Established transformers as effective feature extractors for semantic search and downstream fine-tuning, though BERT's non-causal attention makes it unsuitable for generation without modification.
- T5 (Google): Encoder-decoder transformer treating all NLP tasks as sequence-to-sequence problems. Demonstrates how the full transformer architecture (with explicit attention between encoder and decoder outputs) supports diverse downstream tasks including retrieval-augmented generation pipelines.
See also
- Large language model — The primary application domain for transformer architectures
- Foundation model — Large-scale transformers trained on broad corpora
- Context window — Length constraint directly tied to transformer's quadratic attention complexity
- Embeddings — Token and positional representations input to transformer layers
- In-context learning — Capability emergent from transformer scale and architecture
- Instruction tuning — Training method typically applied to decoder-only transformers
- Retrieval-augmented generation — System architecture often incorporating transformer cross-attention over retrieved documents
- RLHF — Post-training method commonly applied to transformer-based LLMs
References
- ↑ Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NIPS 2017. https://arxiv.org/abs/1706.03762