Principle:FlagOpen FlagEmbedding Dense Retrieval

Sources	Paper: Dense Passage Retrieval, Paper: BGE Embeddings
Domains	Information_Retrieval, NLP

Overview

A retrieval method that encodes queries and corpus passages into dense vectors and uses approximate nearest neighbor search to find the most relevant passages.

Description

Dense retrieval works in three stages:

Corpus encoding: The entire corpus is encoded into dense embedding vectors using the embedding model. These embeddings are stored as a NumPy array and can be cached to disk (as doc.npy) to avoid re-encoding across evaluation runs.
Query encoding: All queries are encoded into dense embedding vectors using the same model (with optional query-specific instructions).
FAISS index search: A FAISS index is built from the corpus embeddings and searched for the top-k nearest neighbors for each query using inner product similarity.

FlagEmbedding's EvalDenseRetriever implements this pattern with several practical enhancements:

Corpus embedding caching: If corpus_embd_save_dir is provided, corpus embeddings are saved to doc.npy and loaded on subsequent runs (controlled by the overwrite flag).
M3 dict-format handling: The M3 embedding model returns embeddings in dictionary format ({"dense_vecs": ...}). The retriever transparently extracts the dense_vecs key when this format is detected.
Multi-GPU encoding: The underlying embedder supports multi-GPU inference via a process pool, distributing corpus and query encoding across available devices.
Title-text concatenation: If a document contains a "title" field, it is prepended to the text as "{title} {text}" before encoding.
Identical ID filtering: When ignore_identical_ids=True, results where the document ID matches the query ID are excluded (used by some benchmarks, but not MIRACL).

Results are returned as a nested dictionary mapping query IDs to dictionaries of document IDs and their similarity scores: {qid: {docid: score}}.

Usage

As the first stage of a two-stage retrieval pipeline (retrieve then rerank). The dense retriever produces the initial candidate set of search_top_k documents (typically 1000), which is then optionally refined by a reranker down to rerank_top_k (typically 100).

Theoretical Basis

Dense retrieval leverages learned dense representations to capture semantic similarity beyond lexical overlap. The key theoretical components are:

Approximate Nearest Neighbor (ANN) search via FAISS: FlagEmbedding uses IndexFlatIP (exact inner product search, not approximate). For normalized embeddings, inner product equals cosine similarity.
Complexity: Building the index is O(n * d) where n is the corpus size and d is the embedding dimension. Search is O(n * d) per query for exact search with IndexFlatIP (brute-force scan). In practice, FAISS returns k results per query.
Bi-encoder architecture: Queries and documents are encoded independently, enabling the corpus to be pre-encoded once and reused across multiple query sets. This is in contrast to cross-encoders (used in the reranking stage) which jointly encode query-document pairs.
Dense Passage Retrieval (DPR): The foundational work by Karpukhin et al. (2020) demonstrated that dense representations trained on question-answer pairs can outperform BM25 for open-domain question answering. BGE embeddings (Xiao et al., 2023) extend this with improved training techniques and instruction-aware encoding.

Related Pages

Implementation:FlagOpen_FlagEmbedding_EvalDenseRetriever_Call

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment