Principle:FlagOpen FlagEmbedding Dense Retrieval
| Sources | Paper: Dense Passage Retrieval, Paper: BGE Embeddings |
|---|---|
| Domains | Information_Retrieval, NLP |
Overview
A retrieval method that encodes queries and corpus passages into dense vectors and uses approximate nearest neighbor search to find the most relevant passages.
Description
Dense retrieval works in three stages:
- Corpus encoding: The entire corpus is encoded into dense embedding vectors using the embedding model. These embeddings are stored as a NumPy array and can be cached to disk (as
doc.npy) to avoid re-encoding across evaluation runs. - Query encoding: All queries are encoded into dense embedding vectors using the same model (with optional query-specific instructions).
- FAISS index search: A FAISS index is built from the corpus embeddings and searched for the top-k nearest neighbors for each query using inner product similarity.
FlagEmbedding's EvalDenseRetriever implements this pattern with several practical enhancements:
- Corpus embedding caching: If
corpus_embd_save_diris provided, corpus embeddings are saved todoc.npyand loaded on subsequent runs (controlled by theoverwriteflag). - M3 dict-format handling: The M3 embedding model returns embeddings in dictionary format (
{"dense_vecs": ...}). The retriever transparently extracts thedense_vecskey when this format is detected. - Multi-GPU encoding: The underlying embedder supports multi-GPU inference via a process pool, distributing corpus and query encoding across available devices.
- Title-text concatenation: If a document contains a "title" field, it is prepended to the text as
"{title} {text}"before encoding. - Identical ID filtering: When
ignore_identical_ids=True, results where the document ID matches the query ID are excluded (used by some benchmarks, but not MIRACL).
Results are returned as a nested dictionary mapping query IDs to dictionaries of document IDs and their similarity scores: {qid: {docid: score}}.
Usage
As the first stage of a two-stage retrieval pipeline (retrieve then rerank). The dense retriever produces the initial candidate set of search_top_k documents (typically 1000), which is then optionally refined by a reranker down to rerank_top_k (typically 100).
Theoretical Basis
Dense retrieval leverages learned dense representations to capture semantic similarity beyond lexical overlap. The key theoretical components are:
- Approximate Nearest Neighbor (ANN) search via FAISS: FlagEmbedding uses
IndexFlatIP(exact inner product search, not approximate). For normalized embeddings, inner product equals cosine similarity. - Complexity: Building the index is O(n * d) where n is the corpus size and d is the embedding dimension. Search is O(n * d) per query for exact search with IndexFlatIP (brute-force scan). In practice, FAISS returns k results per query.
- Bi-encoder architecture: Queries and documents are encoded independently, enabling the corpus to be pre-encoded once and reused across multiple query sets. This is in contrast to cross-encoders (used in the reranking stage) which jointly encode query-document pairs.
- Dense Passage Retrieval (DPR): The foundational work by Karpukhin et al. (2020) demonstrated that dense representations trained on question-answer pairs can outperform BM25 for open-domain question answering. BGE embeddings (Xiao et al., 2023) extend this with improved training techniques and instruction-aware encoding.