Embedding model
Overview
An embedding model is a type of foundation model trained to map textual input into a continuous vector space where semantic similarity is reflected in vector proximity. These models produce dense representations, typically ranging from 384 to 1536 dimensions, that encode meaning in a way suitable for downstream retrieval, clustering, and comparison tasks.
Embedding models are distinct from large language models that generate text. Instead, they are purpose-built for encoding: they consume text and output a single vector per input sequence. This architectural choice makes them efficient for retrieval-augmented generation workflows, where encoded documents and queries must be compared at scale using vector database systems.
The training process typically involves contrastive learning objectives, where the model learns to place semantically similar text pairs close together in vector space and dissimilar pairs far apart. This enables semantic search applications without requiring keyword overlap. Modern embedding models are often trained using instruction-tuning approaches, allowing them to adapt to domain-specific retrieval tasks through fine-tuning or in-context adaptation.
Embedding models form the retrieval component of RAG pipelines, where they encode both queries and document chunks to enable semantic matching before prompt construction. Their performance directly influences retrieval precision and recall in downstream applications.
How it works
Embedding models operate through a two-stage process: tokenization and vector projection.
First, input text is tokenized using a tokenization scheme (typically subword tokenization). The tokens are passed through a transformer encoder backbone, which applies self-attention across token positions. The transformer produces contextual representations for each token.
Second, the model aggregates token-level representations into a single vector. Common aggregation strategies include: mean pooling over all tokens, using the representation of a special token (e.g., [CLS]), or attention-weighted pooling. The aggregated representation is optionally normalized to unit length (L2 normalization), placing all vectors on a unit hypersphere.
The output vector can then be compared to other vectors using distance metrics. Cosine similarity (dot product of normalized vectors) is standard, though Euclidean distance and other metrics are supported by vector database systems.
Training uses contrastive objectives such as InfoNCE loss, where a query vector is pushed close to its positive document vectors and away from negatives. Instruction tuning further improves generalization by fine-tuning on annotated datasets where queries and relevant passages are paired with natural language instructions describing the task (e.g., "Retrieve documents relevant to this question").
Chunking strategies determine how long-form documents are segmented before embedding, affecting both retrieval coverage and computational cost. Contextual retrieval techniques enhance embedding quality by including surrounding context during encoding.
| Term | Distinction |
|---|---|
| Large language model | LLMs generate sequences of tokens autoregressively; embedding models produce a single fixed vector. LLMs are used for generation; embedding models are used for retrieval and comparison. |
| Reranker | Embedding models perform first-pass retrieval by vector similarity; rerankers score and order an already-retrieved candidate set using a more computationally expensive comparison. They are often used sequentially in RAG pipelines. |
| BM25 | BM25 is a lexical retrieval method based on term frequency and document structure; embedding models perform semantic retrieval by learned vector similarity. BM25 is parameter-free; embedding models require pre-trained weights. |
| Semantic search | Semantic search is a retrieval task or capability; an embedding model is the technical component that enables semantic search by encoding text into comparable vectors. |
| Embeddings | "Embeddings" is the general term for vector representations of any data; an "embedding model" is the specific neural network that produces those embeddings. |
Examples
OpenAI text-embedding-3-large (2024) produces 3072-dimensional vectors trained on contrastive objectives and instruction tuning. It is widely used in RAG systems and supports retrieval of both short queries and long documents.
Sentence-BERT (SBERT) is an open-source embedding model family based on BERT fine-tuned with contrastive loss on sentence-pair datasets. Variants range from 384 to 768 dimensions and are commonly used in semantic search and clustering applications.
Cohere Embed-English-v3.0 supports sparse and dense vector outputs simultaneously, enabling hybrid search approaches that combine BM25 lexical retrieval with dense semantic retrieval in a single query.
See also
- Embeddings — the general vector representations produced by embedding models
- Retrieval-augmented generation — the primary application pattern for embedding models
- Vector database — systems that store and retrieve embeddings at scale
- Semantic search — the retrieval task enabled by embedding model similarity
- Chunking strategy — preprocessing decisions that affect embedding quality
- Hybrid search — combining dense embedding retrieval with sparse lexical methods
- Reranker — post-retrieval ranking to refine embedding-based candidate sets