Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepset ai Haystack Embedding Based Retrieval

From Leeroopedia

Metadata

Field Value
Principle Name Embedding-Based Retrieval
Domains Information_Retrieval, NLP
Related Implementation Deepset_ai_Haystack_InMemoryEmbeddingRetriever
Source Reference haystack/components/retrievers/in_memory/embedding_retriever.py:L13-238
Repository Deepset_ai_Haystack

Overview

Embedding-based retrieval finds documents whose vector representations are most similar to a query embedding, enabling semantic search beyond exact keyword matching. By comparing dense vectors in a shared embedding space, this approach can identify relevant documents even when they use different terminology than the query.

Description

Embedding-based retrieval (also called dense retrieval or semantic retrieval) is a two-phase process:

  1. Indexing phase: Each document in the corpus is embedded into a dense vector using a document embedder (such as SentenceTransformersDocumentEmbedder). These vectors are stored alongside the documents in a document store.
  2. Query phase: The user's query is embedded using a text embedder (such as SentenceTransformersTextEmbedder) that uses the same model. The resulting query vector is compared against all stored document vectors, and the documents with the highest similarity scores are returned.

The retriever component itself does not perform any embedding. It receives a pre-computed query embedding vector and delegates the similarity computation to the document store. The document store computes similarity scores between the query vector and all stored document vectors, then returns the top-k most similar documents.

Key advantages of embedding-based retrieval over keyword-based methods:

  • Semantic understanding: Captures meaning beyond exact word matches. A query for "automobile" can retrieve documents about "cars" even if the word "automobile" never appears.
  • Vocabulary mismatch tolerance: Handles synonyms, paraphrases, and multilingual queries naturally.
  • Contextual similarity: The same word in different contexts produces different embeddings, enabling disambiguation.

Key limitations:

  • Computational cost: Requires a neural model to embed documents (offline) and queries (online). Query-time embedding adds latency compared to BM25.
  • Model dependency: Retrieval quality depends heavily on the embedding model's training data and domain fit.
  • No exact match guarantee: May fail to retrieve documents that share exact rare terms with the query, where BM25 would succeed.

Theoretical Basis

Dense Retrieval with Bi-Encoders

Embedding-based retrieval uses the bi-encoder architecture, where documents and queries are independently encoded into a shared vector space. The key insight is that semantic similarity in natural language can be approximated by geometric proximity in vector space.

Given a query embedding q and a document embedding d, relevance is scored by:

  • Dot product: score = q . d
  • Cosine similarity: score = (q . d) / (||q|| * ||d||)

When embeddings are L2-normalized (unit length), dot product and cosine similarity are equivalent.

Overcoming Vocabulary Mismatch

The fundamental limitation of keyword-based retrieval (BM25, TF-IDF) is the vocabulary mismatch problem: a query about "canines" will not match a document about "dogs" unless explicit synonym handling is added. Dense retrieval overcomes this because the embedding model learns that "canines" and "dogs" have similar meanings during training, placing their embeddings close together in vector space.

Approximate Nearest Neighbor Search

For large document collections, exact nearest neighbor search over all document vectors becomes computationally expensive. Production systems typically use approximate nearest neighbor (ANN) algorithms (such as HNSW, IVF, or product quantization) to trade a small amount of accuracy for orders-of-magnitude speedups. The InMemoryDocumentStore performs exact search, which is suitable for small to medium collections.

Hybrid Retrieval

In practice, embedding-based retrieval is often combined with BM25 in a hybrid retrieval architecture. BM25 handles exact keyword matches well, while embeddings handle semantic similarity. The results from both retrievers are merged (typically using reciprocal rank fusion or a learned merger) and optionally reranked by a cross-encoder.

Usage

Embedding-based retrieval is used in query pipelines where semantic search is needed. A typical pipeline consists of:

  1. A text embedder that converts the user query into a vector.
  2. An embedding retriever that finds the most similar documents in the document store.
  3. Optionally, a ranker that reranks the top results for higher precision.
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Indexing
doc_store = InMemoryDocumentStore()
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()
docs = [
    Document(content="Python is a popular programming language"),
    Document(content="Haystack enables building NLP applications"),
]
embedded_docs = doc_embedder.run(docs)["documents"]
doc_store.write_documents(embedded_docs)

# Querying
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=doc_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

result = query_pipeline.run({"text_embedder": {"text": "NLP framework"}})
for doc in result["retriever"]["documents"]:
    print(doc.content, doc.score)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment