Principle:Deepset ai Haystack Document Joining

Overview

Document Joining is the principle of merging multiple streams of ranked documents into a single unified list. This is essential in hybrid retrieval architectures where multiple retrievers (e.g., keyword-based and semantic) each produce their own ranked list of documents. Document Joining applies configurable fusion strategies to combine these lists, handling duplicate documents and producing a coherent final ranking.

Description

In modern information retrieval systems, it is common to employ multiple retrieval strategies simultaneously. For example, a pipeline might use both BM25 keyword search and dense vector embedding search to find relevant documents. Each retriever produces its own ranked list, and these lists must be combined intelligently before passing results to downstream components like rerankers or answer generators.

Document Joining supports several fusion strategies:

Concatenation

The simplest strategy: all document lists are concatenated into one. When the same document appears in multiple lists, only the copy with the highest score is kept. This approach is fast but does not consider the relative ranking positions across lists.

Merge

Documents are merged by calculating a weighted sum of their scores across lists. If a document appears in multiple retriever outputs, its final score is the sum of its scores weighted by the importance assigned to each retriever. This allows tuning the influence of each retrieval method.

Reciprocal Rank Fusion (RRF)

A rank-based fusion method that assigns scores based on document positions rather than raw scores. The score formula is:

score(d) = sum( w_i * N / (k + rank_i(d)) )

Where:

w_i is the weight for the i-th document list
N is the number of document lists
k is a constant (set to 61 in the implementation, following the original paper plus 1 for 0-based indexing)
rank_i(d) is the rank of document d in the i-th list

Scores are normalized by dividing by N / k, so the maximum possible score (being ranked first in all lists) equals k. RRF is robust because it does not depend on the scale or distribution of the original scores.

Distribution-Based Rank Fusion (DBRF)

A score normalization approach that standardizes scores within each retriever's output using the score distribution (mean and standard deviation). Each document's score is normalized to a common scale using a 3-sigma range before concatenation. This accounts for different score distributions across retrievers.

Usage

Document Joining is used in hybrid retrieval pipelines where results from multiple retrievers must be combined. It sits at the convergence point where multiple retriever branches feed into a single downstream path.

[BM25Retriever] -------\
                        +--> [DocumentJoiner] --> [Reranker / Reader]
[EmbeddingRetriever] --/

Theoretical Basis

Reciprocal Rank Fusion was introduced by Cormack, Clarke, and Buettcher (2009). The key insight is that rank positions are more comparable across different retrieval methods than raw scores, since different retrievers use different scoring functions with different scales and distributions. The constant k (typically 60) dampens the impact of high rankings so that being ranked first is not disproportionately more valuable than being ranked second.

Distribution-Based Score Fusion normalizes scores using statistical properties (mean, standard deviation) of each retriever's score distribution, mapping all scores to a common [0, 1] range using a 3-sigma normalization. This approach is particularly useful when retrievers produce scores with very different magnitudes.

The weighted merge strategy is based on simple linear combination, a standard technique in ensemble methods where the contribution of each model is scaled by a user-defined weight.

Related Pages

Deepset_ai_Haystack_DocumentJoiner - Implementation of Document Joining in Haystack
Deepset_ai_Haystack_Document_Splitting - Splitting documents before retrieval
Deepset_ai_Haystack_Metadata_Based_Routing - Routing documents based on metadata

Implemented By

Implementation:Deepset_ai_Haystack_DocumentJoiner

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment