Principle:Deepset ai Haystack Embedding Based Retrieval
Metadata
| Field | Value |
|---|---|
| Principle Name | Embedding-Based Retrieval |
| Domains | Information_Retrieval, NLP |
| Related Implementation | Deepset_ai_Haystack_InMemoryEmbeddingRetriever |
| Source Reference | haystack/components/retrievers/in_memory/embedding_retriever.py:L13-238
|
| Repository | Deepset_ai_Haystack |
Overview
Embedding-based retrieval finds documents whose vector representations are most similar to a query embedding, enabling semantic search beyond exact keyword matching. By comparing dense vectors in a shared embedding space, this approach can identify relevant documents even when they use different terminology than the query.
Description
Embedding-based retrieval (also called dense retrieval or semantic retrieval) is a two-phase process:
- Indexing phase: Each document in the corpus is embedded into a dense vector using a document embedder (such as
SentenceTransformersDocumentEmbedder). These vectors are stored alongside the documents in a document store. - Query phase: The user's query is embedded using a text embedder (such as
SentenceTransformersTextEmbedder) that uses the same model. The resulting query vector is compared against all stored document vectors, and the documents with the highest similarity scores are returned.
The retriever component itself does not perform any embedding. It receives a pre-computed query embedding vector and delegates the similarity computation to the document store. The document store computes similarity scores between the query vector and all stored document vectors, then returns the top-k most similar documents.
Key advantages of embedding-based retrieval over keyword-based methods:
- Semantic understanding: Captures meaning beyond exact word matches. A query for "automobile" can retrieve documents about "cars" even if the word "automobile" never appears.
- Vocabulary mismatch tolerance: Handles synonyms, paraphrases, and multilingual queries naturally.
- Contextual similarity: The same word in different contexts produces different embeddings, enabling disambiguation.
Key limitations:
- Computational cost: Requires a neural model to embed documents (offline) and queries (online). Query-time embedding adds latency compared to BM25.
- Model dependency: Retrieval quality depends heavily on the embedding model's training data and domain fit.
- No exact match guarantee: May fail to retrieve documents that share exact rare terms with the query, where BM25 would succeed.
Theoretical Basis
Dense Retrieval with Bi-Encoders
Embedding-based retrieval uses the bi-encoder architecture, where documents and queries are independently encoded into a shared vector space. The key insight is that semantic similarity in natural language can be approximated by geometric proximity in vector space.
Given a query embedding q and a document embedding d, relevance is scored by:
- Dot product:
score = q . d - Cosine similarity:
score = (q . d) / (||q|| * ||d||)
When embeddings are L2-normalized (unit length), dot product and cosine similarity are equivalent.
Overcoming Vocabulary Mismatch
The fundamental limitation of keyword-based retrieval (BM25, TF-IDF) is the vocabulary mismatch problem: a query about "canines" will not match a document about "dogs" unless explicit synonym handling is added. Dense retrieval overcomes this because the embedding model learns that "canines" and "dogs" have similar meanings during training, placing their embeddings close together in vector space.
Approximate Nearest Neighbor Search
For large document collections, exact nearest neighbor search over all document vectors becomes computationally expensive. Production systems typically use approximate nearest neighbor (ANN) algorithms (such as HNSW, IVF, or product quantization) to trade a small amount of accuracy for orders-of-magnitude speedups. The InMemoryDocumentStore performs exact search, which is suitable for small to medium collections.
Hybrid Retrieval
In practice, embedding-based retrieval is often combined with BM25 in a hybrid retrieval architecture. BM25 handles exact keyword matches well, while embeddings handle semantic similarity. The results from both retrievers are merged (typically using reciprocal rank fusion or a learned merger) and optionally reranked by a cross-encoder.
Usage
Embedding-based retrieval is used in query pipelines where semantic search is needed. A typical pipeline consists of:
- A text embedder that converts the user query into a vector.
- An embedding retriever that finds the most similar documents in the document store.
- Optionally, a ranker that reranks the top results for higher precision.
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Indexing
doc_store = InMemoryDocumentStore()
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()
docs = [
Document(content="Python is a popular programming language"),
Document(content="Haystack enables building NLP applications"),
]
embedded_docs = doc_embedder.run(docs)["documents"]
doc_store.write_documents(embedded_docs)
# Querying
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=doc_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
result = query_pipeline.run({"text_embedder": {"text": "NLP framework"}})
for doc in result["retriever"]["documents"]:
print(doc.content, doc.score)
Related Pages
- Implementation: Deepset_ai_Haystack_InMemoryEmbeddingRetriever -- The concrete Haystack component that implements this principle.
- Related Principle: Deepset_ai_Haystack_Document_Embedding -- The indexing-side process that produces document embeddings.
- Related Principle: Deepset_ai_Haystack_Query_Text_Embedding -- The query-side process that produces the query embedding.
- Related Principle: Deepset_ai_Haystack_BM25_Keyword_Retrieval -- Keyword-based retrieval, the sparse counterpart to embedding-based retrieval.
- Related Principle: Deepset_ai_Haystack_Cross_Encoder_Reranking -- Reranking technique often applied on top of embedding retrieval results.