BM25

From llmref.wiki
BM25 — Probabilistic bag-of-words retrieval function that ranks documents using term frequency, inverse document frequency, and document length normalization.

Overview

BM25 (Best Matching 25) is a language-independent probabilistic information retrieval function designed to rank documents by relevance to a given query. It operates as a sparse retrieval method, treating documents as unordered collections of terms (bag-of-words) without considering word order or semantic relationships. BM25 extends earlier term-frequency–inverse-document-frequency (TF-IDF) models by incorporating document length normalization and tunable saturation parameters, producing more stable ranking scores across documents of varying lengths.

The function is widely used in information retrieval systems, retrieval-augmented generation (RAG) pipelines, and hybrid search architectures as a baseline retrieval mechanism. Unlike dense embedding-based methods, BM25 requires no learned model parameters or vector database, making it computationally efficient and deterministic. It remains the de facto standard sparse retrieval baseline against which neural and semantic methods are compared.

BM25 is particularly valuable in chunking and reranking workflows within LLM applications, where retrieved passages must balance relevance scoring with computational cost. Its effectiveness depends on query and document preprocessing (stemming, stop-word removal) and two hyperparameters: k₁ (saturation parameter) and b (length normalization coefficient).

How it works

BM25 computes a relevance score for each document given a query using the formula:

<math>\text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}</math>

where:

  • <math>Q = \{q_1, q_2, \ldots, q_n\}</math> is the query term set
  • <math>f(q_i, D)</math> is the frequency of query term <math>q_i</math> in document <math>D</math>
  • <math>|D|</math> is the length of document <math>D</math> in tokens
  • <math>\text{avgdl}</math> is the average document length in the collection
  • <math>\text{IDF}(q_i) = \log\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}\right)</math> is the inverse document frequency, with <math>N</math> being total documents and <math>n(q_i)</math> the count of documents containing <math>q_i</math>
  • <math>k_1</math> (typically 1.2–2.0) controls term-frequency saturation
  • <math>b</math> (typically 0.75) controls the degree of length normalization

The IDF component penalizes common terms appearing in many documents, while the normalized TF component prevents long documents from dominating rankings solely due to term repetition. The saturation effect of <math>k_1</math> ensures that additional occurrences of a query term yield diminishing relevance gains.

Implementation in practice typically involves: 1. Tokenizing and preprocessing (lowercasing, stemming, stop-word removal) 2. Building an inverted index mapping terms to document IDs and frequencies 3. Computing IDF statistics over the entire collection 4. Scoring each retrieved document at query time 5. Sorting documents by descending BM25 score

Common implementations include Elasticsearch, Lucene, Solr, and specialized retrieval libraries. In RAG systems, BM25 often serves as the initial sparse retrieval stage before reranking with neural methods.

Distinction from related terms

Term Distinction
Semantic embeddings BM25 is a sparse, keyword-matching method requiring no learned representations; embeddings are dense learned vectors capturing semantic similarity independent of exact term overlap. BM25 excels at exact-match retrieval; embeddings excel at paraphrase and synonym matching. BM25 is deterministic; embeddings depend on model training.
Semantic search Semantic search leverages learned representations (embeddings or neural models) to match meaning; BM25 operates on surface-level term statistics. Semantic search may retrieve documents without query terms but with related concepts; BM25 only ranks documents containing query terms.
Hybrid search Hybrid search combines BM25 sparse retrieval with dense embedding retrieval, then fuses or re-ranks results. BM25 is one component of hybrid search, not the whole system. Hybrid approaches mitigate weaknesses of both methods independently.
Reranking BM25 produces initial candidate rankings; rerankers (often neural models) refine those rankings by scoring candidates using learned representations. BM25 is a retrieval-time ranker; rerankers are post-retrieval refinement stages.
TF-IDF TF-IDF is the predecessor without saturation or length normalization. BM25 improves on TF-IDF by preventing document length bias and using sublinear term-frequency scaling, producing more stable scores across document collections of variable length.

Examples

  • Elasticsearch sparse retrieval: Elasticsearch implements BM25 as the default scoring function for full-text search queries. When a user searches "machine learning algorithms," the engine scores all indexed documents using BM25, ranking those with high term frequencies for "machine," "learning," and "algorithms" while suppressing results from documents mentioning only one term or very common terms appearing in most documents.
  • RAG pipeline initialization: A RAG system answering questions about software documentation uses BM25 to retrieve candidate passages from a knowledge base. After tokenizing the query "How do I configure authentication?", BM25 ranks passages by keyword relevance before a neural reranker re-scores top-k candidates. This two-stage approach reduces latency while maintaining retrieval quality.
  • Baseline comparison in academic evaluation: Papers proposing neural retrieval methods (e.g., dense retrieval models, cross-encoders) typically compare against BM25 as the established sparse baseline. Studies show BM25 achieves 0.2–0.4 MRR (mean reciprocal rank) on standard benchmarks like MS MARCO, establishing the minimum performance threshold for new methods to claim improvement.

See also

References