Principle:Deepset ai Haystack Context Relevance Evaluation
Overview
Context relevance evaluation measures whether retrieved context documents are relevant to the input question. It assesses the quality of the retrieval step in a RAG pipeline by determining if the context provided to the generator actually contains information useful for answering the question.
Domains
- Evaluation
- NLP
Theoretical Foundation
In RAG systems, the retriever produces context documents that are fed to the generator. If the retrieved context is irrelevant to the question, the generator cannot produce a good answer regardless of its capability. Context relevance evaluation measures this upstream quality.
LLM-as-Judge Approach
Context relevance evaluation uses an LLM judge to:
- Analyze the context: Extract sentences from the provided context that are relevant to answering the question.
- Binary scoring: Assign a binary score per context:
- 1 if the context contains any relevant statements.
- 0 if the context contains no relevant statements.
Context Relevance Score = (number of relevant contexts) / (total number of contexts)
Relevant Statement Extraction
The LLM is instructed to extract only sentences from the context that are "absolutely relevant and required" to answer the question. This strict criterion ensures that only genuinely useful information is counted.
If no relevant sentences can be found, or if the question cannot be answered from the given context, the LLM returns an empty list.
Scoring Model
- Per-context score: Binary (1 if any relevant statement exists, 0 otherwise).
- Per-query score: The proportion of relevant contexts for that query.
- Aggregate score: Mean of all per-query scores.
This approach identifies which contexts contribute useful information and which are noise or off-topic.
When to Use Context Relevance Evaluation
- Retriever quality assessment: To evaluate whether the retriever is fetching relevant documents.
- Pipeline debugging: To isolate whether poor answer quality stems from retrieval or generation.
- Document store evaluation: To assess if the knowledge base contains relevant information for expected queries.
- Retrieval strategy comparison: To compare different retrieval methods (BM25 vs. embedding-based) on context quality.
Limitations
- LLM judge reliability: The evaluation depends on the quality and consistency of the LLM judge.
- Binary granularity: Contexts are scored as fully relevant (1) or fully irrelevant (0), with no partial relevance.
- Cost: Requires LLM API calls for each evaluation, incurring latency and expense.
- Non-deterministic: Results may vary across runs due to LLM stochasticity.
- Context length sensitivity: Very long contexts may be harder for the LLM to evaluate accurately.
Relationship to Implementation
In the Haystack framework, this principle is realized by the ContextRelevanceEvaluator component, which:
- Extends
LLMEvaluatorwith context-relevance-specific prompting. - Uses OpenAI (via
OpenAIChatGenerator) by default, but accepts any compatibleChatGenerator. - Returns relevant statement extractions, per-context binary scores, and the aggregate relevance score.
External Dependencies
This is a wrapper component that relies on an external LLM API (by default, OpenAI API via the OpenAIChatGenerator). An API key and network access are required.
Related Principles
- Faithfulness Evaluation -- evaluates whether the answer is grounded in the context (downstream quality).
- Retrieval Recall Evaluation -- evaluates whether relevant documents were retrieved (set-based metric without LLM judgment).
References
- Es, S. et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation."