HyDE

From llmref.wiki
HyDE — Prompt engineering technique that generates hypothetical document answers to improve RAG query retrieval quality.

Overview

Hypothetical Document Embeddings (HyDE) is a prompt engineering method that enhances retrieval-augmented generation systems by generating plausible document responses before performing actual retrieval. Rather than directly embedding a user query, HyDE instructs a language model to produce a hypothetical answer matching the semantic properties of relevant documents. These generated hypothetical documents are then embedded and used to retrieve genuine documents from a vector database, typically yielding higher retrieval precision and recall than direct query embedding.

The technique addresses a fundamental mismatch in traditional semantic search systems: user queries and relevant documents often express similar concepts through different linguistic patterns and vocabulary. A question like "How do photosynthesis occurs in plants?" may not embed close to a technical document titled "Chlorophyll-mediated light reactions," even though they address the same content. By generating an intermediate hypothetical document that bridges this gap, HyDE improves the likelihood that the embedding space correctly identifies relevant sources.

HyDE operates within the context window constraints of a single forward pass, making it computationally practical for production systems. The hypothetical documents need not be factually accurate or cite real sources; their purpose is solely to occupy semantic space near genuine relevant documents. This distinguishes HyDE from grounding approaches that focus on source attribution and factual consistency.

How it works

HyDE follows a three-stage pipeline:

  1. Hypothetical document generation: A language model is prompted with the user query and instructed to generate a plausible answer or document excerpt that would satisfy the query. The prompt typically frames this as "Write a document that would answer the following question" without requiring the model to verify claims or provide citations.
  1. Embedding and retrieval: The generated hypothetical document is converted to a dense embedding vector using the same embedding model used for the document corpus. This vector is then used to search a vector database for semantically similar documents, typically via hybrid search combining dense retrieval with BM25 term-based ranking.
  1. Re-ranking and use: Retrieved documents are optionally re-ranked and passed to the language model as context for generating a final answer. Some systems apply LLM-as-judge methods or explicit re-ranking steps at this stage.

The quality of HyDE retrieval depends on the instruction quality in the generation prompt and the alignment between the embedding model and the semantic properties of the document corpus. Variants include multi-hypothesis generation, where several hypothetical documents are generated and their embeddings are averaged or used for ensemble retrieval, improving robustness against single-hypothesis bias.

Distinction from related terms

Term Distinction
RAG RAG is a broader architectural pattern for augmenting language models with external documents. HyDE is a specific technique for improving the retrieval component of RAG systems by using hypothetical documents as retrieval queries.
Semantic search Semantic search directly embeds user queries for retrieval. HyDE first transforms queries into hypothetical documents before embedding, introducing an intermediate generation step intended to improve embedding alignment.
Chain-of-thought Chain-of-thought prompting decomposes reasoning steps for a single task. HyDE generates hypothetical documents specifically to improve retrieval quality, not to improve the reasoning process itself.
Re-ranking Re-ranking reorders already-retrieved documents using a separate scoring model. HyDE affects which documents are retrieved in the first place, operating earlier in the pipeline.
Query Fan-Out Query Fan-Out expands a single query into multiple search queries. HyDE generates a single hypothetical document intended to match the semantic profile of relevant documents, rather than generating alternative queries.

Examples

  • Open-domain question answering: For a query "What year did the Voyager 1 probe launch?", HyDE might generate a hypothetical document: "Voyager 1 was launched by NASA in 1977 as part of the Voyager program..." This generated text embeds closer to authoritative astronomy and space exploration documents than the original question alone, improving retrieval of relevant Wikipedia articles or NASA technical reports.
  • E-commerce product search: For a query "comfortable shoes for walking long distances," HyDE generates a hypothetical product description: "These walking shoes feature cushioned insoles, breathable mesh uppers, and arch support suitable for extended wear..." This embedding retrieves actual product pages with similar descriptions more effectively than keyword-based matching alone.
  • Academic paper retrieval: For a research question "machine learning methods for time series forecasting," HyDE generates: "This paper presents a neural network architecture for predicting future values in temporal sequences using attention mechanisms..." This hypothetical abstract embeds near genuine papers on recurrent neural networks and temporal prediction, improving recall in uncontaminated academic corpora.

See also

References