Embeddings
Overview
In the context of language models, embeddings are dense, fixed-length numeric vectors that encode the semantic content of text. Two pieces of text with similar meaning will have embedding vectors that are close in vector space (as measured by cosine similarity or Euclidean distance), even if the surface text differs entirely. This property makes embeddings the enabling technology for semantic search, retrieval-augmented generation, and clustering.
Text embeddings are produced by an embedding model — a neural network that processes input text and outputs a vector. The embedding model is trained on large corpora with an objective that encourages semantic proximity (e.g., contrastive learning on question-answer pairs).
Common embedding dimensions range from 384 to 3072 depending on the model; larger dimensions generally encode richer distinctions but require more storage and compute.
LLM-era usage vs. pre-2022 usage
| Dimension | Pre-2022 usage | LLM-era usage |
|---|---|---|
| Dominant models | Word2Vec, GloVe, BERT | OpenAI text-embedding-3, Cohere Embed, open models (BGE, E5) |
| Granularity | Word or sentence level | Chunk / passage level for RAG |
| Primary use | Semantic similarity, classification | Dense retrieval, vector stores, RAG pipelines |
| Deployment | Offline NLP pipelines | Always-online vector databases (Pinecone, Weaviate, pgvector) |
In RAG pipelines, embeddings are generated for each document chunk at index time and for each query at retrieval time; nearest-neighbor search over the index returns the most semantically relevant chunks.
Embeddings are not model knowledge
A common misconception: embeddings are not a representation of what a language model knows. The generative model and the embedding model are typically separate systems. Embeddings encode semantic relatedness; they do not store factual content — the facts come from the documents the embeddings index.
Key properties
- Dimensionality: the length of the vector; higher dimensions typically capture more nuance.
- Matryoshka embeddings: a training technique allowing truncation to smaller dimensions with graceful quality degradation.
- Context length limit: embedding models have a maximum input length (e.g., 512–8192 tokens); inputs exceeding this are truncated.