Implementation:Deepset ai Haystack ContextRelevanceEvaluator
Overview
ContextRelevanceEvaluator is a Haystack evaluator component that checks whether provided contexts are relevant to the input questions. It uses an LLM to extract relevant statements from each context, producing a binary relevance score per context and an aggregate score across all queries.
Implements Principle
Principle:Deepset_ai_Haystack_Context_Relevance_Evaluation
Source Location
haystack/components/evaluators/context_relevance.py (Lines 42-188)
Import
from haystack.components.evaluators import ContextRelevanceEvaluator
Component Registration
ContextRelevanceEvaluator is decorated with @component and extends LLMEvaluator, making it a standard Haystack pipeline component with LLM-backed evaluation capabilities.
External Dependencies (Wrapper)
This component wraps an external LLM service:
- Default: OpenAI API via
OpenAIChatGenerator(requiresOPENAI_API_KEYenvironment variable). - Custom: Any
ChatGeneratorinstance can be provided. The LLM must be configured to return JSON output.
API
Constructor
def __init__(
self,
examples: list[dict[str, Any]] | None = None,
progress_bar: bool = True,
raise_on_failure: bool = True,
chat_generator: ChatGenerator | None = None,
):
Parameters:
- examples (
list[dict] | None, default:None) -- Optional few-shot examples for the LLM judge. If not provided, default examples are used. Each example must have"inputs"(with keys"questions","contexts") and"outputs"(with key"relevant_statements"). - progress_bar (
bool, default:True) -- Whether to show a progress bar during evaluation. - raise_on_failure (
bool, default:True) -- Whether to raise an exception if the API call fails. - chat_generator (
ChatGenerator | None, default:None) -- A ChatGenerator instance representing the LLM. Must be configured for JSON output. IfNone, uses OpenAIChatGenerator.
run()
def run(
self,
questions: list[str],
contexts: list[list[str]]
) -> dict[str, Any]:
Parameters:
- questions (
list[str]) -- A list of questions. - contexts (
list[list[str]]) -- A list of lists of contexts. Each list of contexts corresponds to one question.
Returns: A dictionary with the following keys:
- score (
float) -- Mean context relevance score over all queries. - individual_scores (
list[int]) -- A list of binary scores (0 or 1) for each query context. - results (
list[dict]) -- A list of dictionaries, each containing:- relevant_statements (
list[str]) -- Extracted relevant statements from the context. - score (
float) -- Binary score (1.0 if any relevant statements exist, 0.0 otherwise).
- relevant_statements (
to_dict() / from_dict()
Serialization and deserialization methods for pipeline export and import. Handles chat generator serialization.
Internal Architecture
ContextRelevanceEvaluator extends LLMEvaluator and configures it with:
- Instructions: A prompt directing the LLM to extract only sentences that are "absolutely relevant and required" to answer the question.
- Input specification:
[("questions", list[str]), ("contexts", list[list[str]])] - Output specification:
["relevant_statements"] - Examples: Three default few-shot examples covering relevant, irrelevant, and directly relevant contexts.
Scoring Logic
After the base LLMEvaluator.run() processes each input:
- For each result, if
relevant_statementsis non-empty, setscore = 1; otherwisescore = 0. - If a result is
None(API failure), set relevant statements to an empty list and score toNaN. - Compute the overall
scoreas the mean of all per-query scores. - Populate
individual_scoresfrom each result's score.
Default Few-Shot Examples
# Example 1: Context is relevant
{
"inputs": {
"questions": "What is the capital of Germany?",
"contexts": ["Berlin is the capital of Germany. Berlin and was founded in 1244."],
},
"outputs": {"relevant_statements": ["Berlin is the capital of Germany."]},
}
# Example 2: Context is NOT relevant
{
"inputs": {
"questions": "What is the capital of France?",
"contexts": [
"Berlin is the capital of Germany and was founded in 1244.",
"Europe is a continent with 44 countries.",
"Madrid is the capital of Spain.",
],
},
"outputs": {"relevant_statements": []},
}
# Example 3: Context is relevant
{
"inputs": {
"questions": "What is the capital of Italy?",
"contexts": ["Rome is the capital of Italy."],
},
"outputs": {"relevant_statements": ["Rome is the capital of Italy."]},
}
Usage Example
from haystack.components.evaluators import ContextRelevanceEvaluator
evaluator = ContextRelevanceEvaluator()
questions = [
"Who created the Python language?",
"Why does Java need a JVM?",
"Is C++ better than Python?",
]
contexts = [
[
"Python, created by Guido van Rossum in the late 1980s, is a high-level "
"general-purpose programming language."
],
[
"Java is a high-level, class-based, object-oriented programming language. "
"The JVM has two primary functions: to allow Java programs to run on any device "
"or operating system, and to manage and optimize program memory."
],
[
"C++ is a general-purpose programming language created by Bjarne Stroustrup "
"as an extension of the C programming language."
],
]
result = evaluator.run(questions=questions, contexts=contexts)
print(result["score"])
# 0.67
print(result["individual_scores"])
# [1, 1, 0]
print(result["results"])
# [{'relevant_statements': ['Python, created by Guido van Rossum...'], 'score': 1.0},
# {'relevant_statements': ['The JVM has two primary functions...'], 'score': 1.0},
# {'relevant_statements': [], 'score': 0.0}]
Using a Custom ChatGenerator
from haystack.components.evaluators import ContextRelevanceEvaluator
from haystack.components.generators.chat import OpenAIChatGenerator
custom_llm = OpenAIChatGenerator(
model="gpt-4o",
generation_kwargs={"response_format": {"type": "json_object"}},
)
evaluator = ContextRelevanceEvaluator(chat_generator=custom_llm)
Important Notes
- JSON mode required: The chat generator must be configured to return JSON. For OpenAI, pass
{"response_format": {"type": "json_object"}}ingeneration_kwargs. - API key required: By default uses OpenAI; set
OPENAI_API_KEYenvironment variable. - Binary scoring: Unlike FaithfulnessEvaluator which provides proportional scores, ContextRelevanceEvaluator uses binary scoring (relevant or not).
- Non-deterministic: Results may vary between runs due to LLM stochasticity.
- NaN handling: Failed API calls produce
NaNscores for affected queries. - Note on inputs: Unlike document-based evaluators, this component takes string contexts rather than Document objects.
Dependencies
haystackcore librarystatistics-- for mean computation- External LLM API (OpenAI by default)