Principle:Deepset ai Haystack Retrieval MAP Evaluation
Overview
Mean Average Precision (MAP) evaluates retrieval quality by averaging precision at each relevant document position across all queries. Unlike MRR, which only considers the first relevant document, MAP rewards systems that rank all relevant documents highly.
Domains
- Evaluation
- Information_Retrieval
Theoretical Foundation
MAP is defined as the mean of the Average Precision (AP) scores across all queries:
MAP = (1/Q) * sum(AP_i) for i = 1..Q
AP = (1/R) * sum(P(k) * rel(k)) for k = 1..n
Where:
- Q is the total number of queries.
- AP_i is the Average Precision for query i.
- R is the number of relevant documents for the query.
- P(k) is the precision at position k in the ranked list.
- rel(k) is a binary indicator (1 if the document at rank k is relevant, 0 otherwise).
- n is the total number of retrieved documents.
Worked Example
Consider a query with 2 relevant documents. The retrieved list has 3 documents:
- Rank 1: relevant (precision = 1/1 = 1.0)
- Rank 2: not relevant
- Rank 3: relevant (precision = 2/3 = 0.667)
AP = (1/2) * (1.0 + 0.667) = 0.833
Key Properties
- Considers full ranking: Unlike MRR, MAP accounts for the positions of all relevant documents, not just the first.
- Score range: MAP scores range from 0.0 to 1.0. A score of 1.0 means all relevant documents are ranked at the top of every result list.
- Precision-oriented: MAP incorporates precision at each relevant position, penalizing systems that intersperse irrelevant results among relevant ones.
When to Use MAP
MAP is ideal for scenarios where retrieving all relevant documents matters:
- Multi-document question answering: Where multiple passages contribute to the answer.
- Comprehensive retrieval: Where missing a relevant document impacts downstream quality.
- RAG evaluation: Where the quality of the full context window depends on ranking all relevant documents highly.
Limitations
- Assumes binary relevance (relevant or not). Does not handle graded relevance scores.
- Requires knowledge of all relevant documents (the ground truth set must be complete).
- Can be dominated by queries with many relevant documents.
Relationship to Implementation
In the Haystack framework, this principle is realized by the DocumentMAPEvaluator component, which:
- Accepts lists of ground truth documents and retrieved documents per query.
- Computes Average Precision for each query based on content matching.
- Returns both individual per-query AP scores and the aggregated MAP score.
Related Principles
- Retrieval MRR Evaluation -- focuses only on the first relevant document's rank.
- Retrieval Recall Evaluation -- measures the proportion of relevant documents found, regardless of rank.
References
- Pinecone: Offline Evaluation Metrics
- Manning, C. D., Raghavan, P., & Schutze, H. (2008). "Introduction to Information Retrieval." Cambridge University Press.