Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding EvalDenseRetriever Call

From Leeroopedia


Type API Doc
Source FlagEmbedding/abc/evaluation/searcher.py: L71-157
Import from FlagEmbedding.abc.evaluation.searcher import EvalDenseRetriever

Constructor

class EvalDenseRetriever(EvalRetriever):
    def __init__(self, embedder: AbsEmbedder, search_top_k: int = 1000, overwrite: bool = False):
Parameter Type Default Description
embedder AbsEmbedder required The embedding model instance used to encode corpus and queries
search_top_k int 1000 Number of top results to retrieve per query
overwrite bool False Whether to overwrite cached corpus embeddings

__call__ Signature

def __call__(
    self,
    corpus: Dict[str, Dict[str, Any]],
    queries: Dict[str, str],
    corpus_embd_save_dir: Optional[str] = None,
    ignore_identical_ids: bool = False,
    **kwargs,
) -> Dict[str, Dict[str, float]]:

Parameters

Parameter Type Default Description
corpus Dict[str, Dict[str, Any]] required Corpus of documents. Structure: {docid: {"text": str}} or {docid: {"title": str, "text": str}}
queries Dict[str, str] required Queries to search for. Structure: {qid: query_text}
corpus_embd_save_dir Optional[str] None Directory to save/load corpus embeddings as doc.npy. If None, embeddings are not cached.
ignore_identical_ids bool False If True, excludes results where doc ID equals query ID
**kwargs Additional arguments passed to the embedder's encode methods

Returns

Type Description
Dict[str, Dict[str, float]] Top-k search results. Structure: {qid: {docid: score}}. Higher scores indicate more relevant documents.

Description

The __call__ method performs the full dense retrieval pipeline. The execution flow is:

Step 1: Extract texts

Corpus documents and queries are extracted into parallel lists of IDs and texts. If a corpus document has a "title" field, it is prepended to the text:

corpus_texts.append(
    doc["text"] if "title" not in doc
    else f"{doc['title']} {doc['text']}".strip()
)

Step 2: Encode corpus

Corpus embeddings are obtained in one of three ways:

  • Load from cache: If corpus_embd_save_dir is set and doc.npy exists and overwrite is False, embeddings are loaded from disk.
  • Encode fresh: Otherwise, self.embedder.encode_corpus(corpus_texts) is called.
  • Save to cache: If corpus_embd_save_dir is set and embeddings were freshly encoded, they are saved to doc.npy.

Step 3: Encode queries

queries_emb = self.embedder.encode_queries(queries_texts, **kwargs)

Step 4: Handle M3 dict format

If the embeddings are returned in dictionary format (as from M3Embedder), the dense vectors are extracted:

if isinstance(corpus_emb, dict):
    corpus_emb = corpus_emb["dense_vecs"]
if isinstance(queries_emb, dict):
    queries_emb = queries_emb["dense_vecs"]

Step 5: Build FAISS index and search

After freeing GPU memory with gc.collect() and torch.cuda.empty_cache(), a FAISS index is built and searched:

faiss_index = index(corpus_embeddings=corpus_emb)
all_scores, all_indices = search(
    query_embeddings=queries_emb,
    faiss_index=faiss_index,
    k=self.search_top_k
)

Step 6: Build results dictionary

The raw FAISS results (score arrays and index arrays) are converted into the output dictionary format, filtering out invalid indices (-1) and optionally identical IDs:

results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
    results[queries_ids[idx]] = {}
    for score, indice in zip(scores, indices):
        if indice != -1:
            if ignore_identical_ids and corpus_ids[indice] == queries_ids[idx]:
                continue
            results[queries_ids[idx]][corpus_ids[indice]] = float(score)

Input / Output

Input:

  • Corpus dictionary: {"doc-0": {"text": "This is a document."}}
  • Queries dictionary: {"q-0": "This is a query."}
  • Optional: path for embedding cache, identical ID filter flag

Output:

  • Search results dictionary: {"q-0": {"doc-0": 0.9, "doc-5": 0.7, ...}}
  • Contains up to search_top_k entries per query, sorted by descending score

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment