Principle:Deepset ai Haystack Context Relevance Evaluation

Overview

Context relevance evaluation measures whether retrieved context documents are relevant to the input question. It assesses the quality of the retrieval step in a RAG pipeline by determining if the context provided to the generator actually contains information useful for answering the question.

Domains

Evaluation
NLP

Theoretical Foundation

In RAG systems, the retriever produces context documents that are fed to the generator. If the retrieved context is irrelevant to the question, the generator cannot produce a good answer regardless of its capability. Context relevance evaluation measures this upstream quality.

LLM-as-Judge Approach

Context relevance evaluation uses an LLM judge to:

Analyze the context: Extract sentences from the provided context that are relevant to answering the question.
Binary scoring: Assign a binary score per context:
- 1 if the context contains any relevant statements.
- 0 if the context contains no relevant statements.

Context Relevance Score = (number of relevant contexts) / (total number of contexts)

Relevant Statement Extraction

The LLM is instructed to extract only sentences from the context that are "absolutely relevant and required" to answer the question. This strict criterion ensures that only genuinely useful information is counted.

If no relevant sentences can be found, or if the question cannot be answered from the given context, the LLM returns an empty list.

Scoring Model

Per-context score: Binary (1 if any relevant statement exists, 0 otherwise).
Per-query score: The proportion of relevant contexts for that query.
Aggregate score: Mean of all per-query scores.

This approach identifies which contexts contribute useful information and which are noise or off-topic.

When to Use Context Relevance Evaluation

Retriever quality assessment: To evaluate whether the retriever is fetching relevant documents.
Pipeline debugging: To isolate whether poor answer quality stems from retrieval or generation.
Document store evaluation: To assess if the knowledge base contains relevant information for expected queries.
Retrieval strategy comparison: To compare different retrieval methods (BM25 vs. embedding-based) on context quality.

Limitations

LLM judge reliability: The evaluation depends on the quality and consistency of the LLM judge.
Binary granularity: Contexts are scored as fully relevant (1) or fully irrelevant (0), with no partial relevance.
Cost: Requires LLM API calls for each evaluation, incurring latency and expense.
Non-deterministic: Results may vary across runs due to LLM stochasticity.
Context length sensitivity: Very long contexts may be harder for the LLM to evaluate accurately.

Relationship to Implementation

In the Haystack framework, this principle is realized by the ContextRelevanceEvaluator component, which:

Extends LLMEvaluator with context-relevance-specific prompting.
Uses OpenAI (via OpenAIChatGenerator) by default, but accepts any compatible ChatGenerator.
Returns relevant statement extractions, per-context binary scores, and the aggregate relevance score.

External Dependencies

This is a wrapper component that relies on an external LLM API (by default, OpenAI API via the OpenAIChatGenerator). An API key and network access are required.

Related Principles

Faithfulness Evaluation -- evaluates whether the answer is grounded in the context (downstream quality).
Retrieval Recall Evaluation -- evaluates whether relevant documents were retrieved (set-based metric without LLM judgment).

References

Es, S. et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation."

Related Pages

Implemented By

Implementation:Deepset_ai_Haystack_ContextRelevanceEvaluator

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment