Implementation:FlagOpen FlagEmbedding EvalDenseRetriever Call
| Type | API Doc |
|---|---|
| Source | FlagEmbedding/abc/evaluation/searcher.py: L71-157
|
| Import | from FlagEmbedding.abc.evaluation.searcher import EvalDenseRetriever
|
Constructor
class EvalDenseRetriever(EvalRetriever):
def __init__(self, embedder: AbsEmbedder, search_top_k: int = 1000, overwrite: bool = False):
| Parameter | Type | Default | Description |
|---|---|---|---|
| embedder | AbsEmbedder |
required | The embedding model instance used to encode corpus and queries |
| search_top_k | int |
1000 |
Number of top results to retrieve per query |
| overwrite | bool |
False |
Whether to overwrite cached corpus embeddings |
__call__ Signature
def __call__(
self,
corpus: Dict[str, Dict[str, Any]],
queries: Dict[str, str],
corpus_embd_save_dir: Optional[str] = None,
ignore_identical_ids: bool = False,
**kwargs,
) -> Dict[str, Dict[str, float]]:
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| corpus | Dict[str, Dict[str, Any]] |
required | Corpus of documents. Structure: {docid: {"text": str}} or {docid: {"title": str, "text": str}}
|
| queries | Dict[str, str] |
required | Queries to search for. Structure: {qid: query_text}
|
| corpus_embd_save_dir | Optional[str] |
None |
Directory to save/load corpus embeddings as doc.npy. If None, embeddings are not cached.
|
| ignore_identical_ids | bool |
False |
If True, excludes results where doc ID equals query ID |
| **kwargs | Additional arguments passed to the embedder's encode methods |
Returns
| Type | Description |
|---|---|
Dict[str, Dict[str, float]] |
Top-k search results. Structure: {qid: {docid: score}}. Higher scores indicate more relevant documents.
|
Description
The __call__ method performs the full dense retrieval pipeline. The execution flow is:
Step 1: Extract texts
Corpus documents and queries are extracted into parallel lists of IDs and texts. If a corpus document has a "title" field, it is prepended to the text:
corpus_texts.append(
doc["text"] if "title" not in doc
else f"{doc['title']} {doc['text']}".strip()
)
Step 2: Encode corpus
Corpus embeddings are obtained in one of three ways:
- Load from cache: If
corpus_embd_save_diris set anddoc.npyexists andoverwriteis False, embeddings are loaded from disk. - Encode fresh: Otherwise,
self.embedder.encode_corpus(corpus_texts)is called. - Save to cache: If
corpus_embd_save_diris set and embeddings were freshly encoded, they are saved todoc.npy.
Step 3: Encode queries
queries_emb = self.embedder.encode_queries(queries_texts, **kwargs)
Step 4: Handle M3 dict format
If the embeddings are returned in dictionary format (as from M3Embedder), the dense vectors are extracted:
if isinstance(corpus_emb, dict):
corpus_emb = corpus_emb["dense_vecs"]
if isinstance(queries_emb, dict):
queries_emb = queries_emb["dense_vecs"]
Step 5: Build FAISS index and search
After freeing GPU memory with gc.collect() and torch.cuda.empty_cache(), a FAISS index is built and searched:
faiss_index = index(corpus_embeddings=corpus_emb)
all_scores, all_indices = search(
query_embeddings=queries_emb,
faiss_index=faiss_index,
k=self.search_top_k
)
Step 6: Build results dictionary
The raw FAISS results (score arrays and index arrays) are converted into the output dictionary format, filtering out invalid indices (-1) and optionally identical IDs:
results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
results[queries_ids[idx]] = {}
for score, indice in zip(scores, indices):
if indice != -1:
if ignore_identical_ids and corpus_ids[indice] == queries_ids[idx]:
continue
results[queries_ids[idx]][corpus_ids[indice]] = float(score)
Input / Output
Input:
- Corpus dictionary:
{"doc-0": {"text": "This is a document."}} - Queries dictionary:
{"q-0": "This is a query."} - Optional: path for embedding cache, identical ID filter flag
Output:
- Search results dictionary:
{"q-0": {"doc-0": 0.9, "doc-5": 0.7, ...}} - Contains up to
search_top_kentries per query, sorted by descending score