Principle:Deepset ai Haystack Document Joining
Overview
Document Joining is the principle of merging multiple streams of ranked documents into a single unified list. This is essential in hybrid retrieval architectures where multiple retrievers (e.g., keyword-based and semantic) each produce their own ranked list of documents. Document Joining applies configurable fusion strategies to combine these lists, handling duplicate documents and producing a coherent final ranking.
Description
In modern information retrieval systems, it is common to employ multiple retrieval strategies simultaneously. For example, a pipeline might use both BM25 keyword search and dense vector embedding search to find relevant documents. Each retriever produces its own ranked list, and these lists must be combined intelligently before passing results to downstream components like rerankers or answer generators.
Document Joining supports several fusion strategies:
Concatenation
The simplest strategy: all document lists are concatenated into one. When the same document appears in multiple lists, only the copy with the highest score is kept. This approach is fast but does not consider the relative ranking positions across lists.
Merge
Documents are merged by calculating a weighted sum of their scores across lists. If a document appears in multiple retriever outputs, its final score is the sum of its scores weighted by the importance assigned to each retriever. This allows tuning the influence of each retrieval method.
Reciprocal Rank Fusion (RRF)
A rank-based fusion method that assigns scores based on document positions rather than raw scores. The score formula is:
score(d) = sum( w_i * N / (k + rank_i(d)) )
Where:
w_iis the weight for the i-th document listNis the number of document listskis a constant (set to 61 in the implementation, following the original paper plus 1 for 0-based indexing)rank_i(d)is the rank of document d in the i-th list
Scores are normalized by dividing by N / k, so the maximum possible score (being ranked first in all lists) equals k. RRF is robust because it does not depend on the scale or distribution of the original scores.
Distribution-Based Rank Fusion (DBRF)
A score normalization approach that standardizes scores within each retriever's output using the score distribution (mean and standard deviation). Each document's score is normalized to a common scale using a 3-sigma range before concatenation. This accounts for different score distributions across retrievers.
Usage
Document Joining is used in hybrid retrieval pipelines where results from multiple retrievers must be combined. It sits at the convergence point where multiple retriever branches feed into a single downstream path.
[BM25Retriever] -------\
+--> [DocumentJoiner] --> [Reranker / Reader]
[EmbeddingRetriever] --/
Theoretical Basis
Reciprocal Rank Fusion was introduced by Cormack, Clarke, and Buettcher (2009). The key insight is that rank positions are more comparable across different retrieval methods than raw scores, since different retrievers use different scoring functions with different scales and distributions. The constant k (typically 60) dampens the impact of high rankings so that being ranked first is not disproportionately more valuable than being ranked second.
Distribution-Based Score Fusion normalizes scores using statistical properties (mean, standard deviation) of each retriever's score distribution, mapping all scores to a common [0, 1] range using a 3-sigma normalization. This approach is particularly useful when retrievers produce scores with very different magnitudes.
The weighted merge strategy is based on simple linear combination, a standard technique in ensemble methods where the contribution of each model is scaled by a user-defined weight.
Related Pages
- Deepset_ai_Haystack_DocumentJoiner - Implementation of Document Joining in Haystack
- Deepset_ai_Haystack_Document_Splitting - Splitting documents before retrieval
- Deepset_ai_Haystack_Metadata_Based_Routing - Routing documents based on metadata