Implementation:Deepset ai Haystack ContextRelevanceEvaluator

Overview

ContextRelevanceEvaluator is a Haystack evaluator component that checks whether provided contexts are relevant to the input questions. It uses an LLM to extract relevant statements from each context, producing a binary relevance score per context and an aggregate score across all queries.

Implements Principle

Principle:Deepset_ai_Haystack_Context_Relevance_Evaluation

Source Location

haystack/components/evaluators/context_relevance.py (Lines 42-188)

Import

from haystack.components.evaluators import ContextRelevanceEvaluator

Component Registration

ContextRelevanceEvaluator is decorated with @component and extends LLMEvaluator, making it a standard Haystack pipeline component with LLM-backed evaluation capabilities.

External Dependencies (Wrapper)

This component wraps an external LLM service:

Default: OpenAI API via OpenAIChatGenerator (requires OPENAI_API_KEY environment variable).
Custom: Any ChatGenerator instance can be provided. The LLM must be configured to return JSON output.

API

Constructor

def __init__(
    self,
    examples: list[dict[str, Any]] | None = None,
    progress_bar: bool = True,
    raise_on_failure: bool = True,
    chat_generator: ChatGenerator | None = None,
):

Parameters:

examples (list[dict] | None, default: None) -- Optional few-shot examples for the LLM judge. If not provided, default examples are used. Each example must have "inputs" (with keys "questions", "contexts") and "outputs" (with key "relevant_statements").
progress_bar (bool, default: True) -- Whether to show a progress bar during evaluation.
raise_on_failure (bool, default: True) -- Whether to raise an exception if the API call fails.
chat_generator (ChatGenerator | None, default: None) -- A ChatGenerator instance representing the LLM. Must be configured for JSON output. If None, uses OpenAIChatGenerator.

run()

def run(
    self,
    questions: list[str],
    contexts: list[list[str]]
) -> dict[str, Any]:

Parameters:

questions (list[str]) -- A list of questions.
contexts (list[list[str]]) -- A list of lists of contexts. Each list of contexts corresponds to one question.

Returns: A dictionary with the following keys:

score (float) -- Mean context relevance score over all queries.
individual_scores (list[int]) -- A list of binary scores (0 or 1) for each query context.
results (list[dict]) -- A list of dictionaries, each containing:
- relevant_statements (list[str]) -- Extracted relevant statements from the context.
- score (float) -- Binary score (1.0 if any relevant statements exist, 0.0 otherwise).

to_dict() / from_dict()

Serialization and deserialization methods for pipeline export and import. Handles chat generator serialization.

Internal Architecture

ContextRelevanceEvaluator extends LLMEvaluator and configures it with:

Instructions: A prompt directing the LLM to extract only sentences that are "absolutely relevant and required" to answer the question.
Input specification: [("questions", list[str]), ("contexts", list[list[str]])]
Output specification: ["relevant_statements"]
Examples: Three default few-shot examples covering relevant, irrelevant, and directly relevant contexts.

Scoring Logic

After the base LLMEvaluator.run() processes each input:

For each result, if relevant_statements is non-empty, set score = 1; otherwise score = 0.
If a result is None (API failure), set relevant statements to an empty list and score to NaN.
Compute the overall score as the mean of all per-query scores.
Populate individual_scores from each result's score.

Default Few-Shot Examples

# Example 1: Context is relevant
{
    "inputs": {
        "questions": "What is the capital of Germany?",
        "contexts": ["Berlin is the capital of Germany. Berlin and was founded in 1244."],
    },
    "outputs": {"relevant_statements": ["Berlin is the capital of Germany."]},
}

# Example 2: Context is NOT relevant
{
    "inputs": {
        "questions": "What is the capital of France?",
        "contexts": [
            "Berlin is the capital of Germany and was founded in 1244.",
            "Europe is a continent with 44 countries.",
            "Madrid is the capital of Spain.",
        ],
    },
    "outputs": {"relevant_statements": []},
}

# Example 3: Context is relevant
{
    "inputs": {
        "questions": "What is the capital of Italy?",
        "contexts": ["Rome is the capital of Italy."],
    },
    "outputs": {"relevant_statements": ["Rome is the capital of Italy."]},
}

Usage Example

from haystack.components.evaluators import ContextRelevanceEvaluator

evaluator = ContextRelevanceEvaluator()

questions = [
    "Who created the Python language?",
    "Why does Java need a JVM?",
    "Is C++ better than Python?",
]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level "
        "general-purpose programming language."
    ],
    [
        "Java is a high-level, class-based, object-oriented programming language. "
        "The JVM has two primary functions: to allow Java programs to run on any device "
        "or operating system, and to manage and optimize program memory."
    ],
    [
        "C++ is a general-purpose programming language created by Bjarne Stroustrup "
        "as an extension of the C programming language."
    ],
]

result = evaluator.run(questions=questions, contexts=contexts)

print(result["score"])
# 0.67
print(result["individual_scores"])
# [1, 1, 0]
print(result["results"])
# [{'relevant_statements': ['Python, created by Guido van Rossum...'], 'score': 1.0},
#  {'relevant_statements': ['The JVM has two primary functions...'], 'score': 1.0},
#  {'relevant_statements': [], 'score': 0.0}]

Using a Custom ChatGenerator

from haystack.components.evaluators import ContextRelevanceEvaluator
from haystack.components.generators.chat import OpenAIChatGenerator

custom_llm = OpenAIChatGenerator(
    model="gpt-4o",
    generation_kwargs={"response_format": {"type": "json_object"}},
)
evaluator = ContextRelevanceEvaluator(chat_generator=custom_llm)

Important Notes

JSON mode required: The chat generator must be configured to return JSON. For OpenAI, pass {"response_format": {"type": "json_object"}} in generation_kwargs.
API key required: By default uses OpenAI; set OPENAI_API_KEY environment variable.
Binary scoring: Unlike FaithfulnessEvaluator which provides proportional scores, ContextRelevanceEvaluator uses binary scoring (relevant or not).
Non-deterministic: Results may vary between runs due to LLM stochasticity.
NaN handling: Failed API calls produce NaN scores for affected queries.
Note on inputs: Unlike document-based evaluators, this component takes string contexts rather than Document objects.

Dependencies

haystack core library
statistics -- for mean computation
External LLM API (OpenAI by default)

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Context_Relevance_Evaluation

Requires Environment

Environment:Deepset_ai_Haystack_OpenAI_API_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment