Implementation:Deepset ai Haystack FaithfulnessEvaluator

Overview

FaithfulnessEvaluator is a Haystack evaluator component that checks whether a generated answer can be inferred from the provided contexts. It uses an LLM to decompose answers into statements and verify each against the context, producing a faithfulness score from 0.0 to 1.0.

Implements Principle

Principle:Deepset_ai_Haystack_Faithfulness_Evaluation

Source Location

haystack/components/evaluators/faithfulness.py (Lines 51-182)

Import

from haystack.components.evaluators import FaithfulnessEvaluator

Component Registration

FaithfulnessEvaluator is decorated with @component and extends LLMEvaluator, making it a standard Haystack pipeline component with LLM-backed evaluation capabilities.

External Dependencies (Wrapper)

This component wraps an external LLM service:

Default: OpenAI API via OpenAIChatGenerator (requires OPENAI_API_KEY environment variable).
Custom: Any ChatGenerator instance can be provided. The LLM must be configured to return JSON output.

API

Constructor

def __init__(
    self,
    examples: list[dict[str, Any]] | None = None,
    progress_bar: bool = True,
    raise_on_failure: bool = True,
    chat_generator: ChatGenerator | None = None,
):

Parameters:

examples (list[dict] | None, default: None) -- Optional few-shot examples for the LLM judge. If not provided, default examples are used. Each example must have "inputs" (with keys "questions", "contexts", "predicted_answers") and "outputs" (with keys "statements", "statement_scores").
progress_bar (bool, default: True) -- Whether to show a progress bar during evaluation.
raise_on_failure (bool, default: True) -- Whether to raise an exception if the API call fails.
chat_generator (ChatGenerator | None, default: None) -- A ChatGenerator instance representing the LLM. Must be configured for JSON output. If None, uses OpenAIChatGenerator.

run()

def run(
    self,
    questions: list[str],
    contexts: list[list[str]],
    predicted_answers: list[str]
) -> dict[str, Any]:

Parameters:

questions (list[str]) -- A list of questions.
contexts (list[list[str]]) -- A nested list of context strings, one list per question.
predicted_answers (list[str]) -- A list of generated answers to evaluate.

Returns: A dictionary with the following keys:

score (float) -- Mean faithfulness score over all answers.
individual_scores (list[int]) -- A list of faithfulness scores for each answer.
results (list[dict]) -- A list of dictionaries, each containing:
- statements (list[str]) -- The extracted statements from the answer.
- statement_scores (list[int]) -- Binary scores (0 or 1) for each statement.
- score (float) -- Mean score for this answer's statements.

to_dict() / from_dict()

Serialization and deserialization methods for pipeline export and import. Handles chat generator serialization.

Internal Architecture

FaithfulnessEvaluator extends LLMEvaluator and configures it with:

Instructions: A system prompt that directs the LLM to extract statements and score each for faithfulness.
Input specification: [("questions", list[str]), ("contexts", list[list[str]]), ("predicted_answers", list[str])]
Output specification: ["statements", "statement_scores"]
Examples: Three default few-shot examples covering fully faithful, unfaithful, and partially faithful scenarios.

Scoring Logic

After the base LLMEvaluator.run() processes each input:

For each result, compute the mean of statement_scores as the per-answer score.
If a result is None (API failure), set statements and scores to empty lists and score to NaN.
If statements list is empty, set the score to 0.
Compute the overall score as the mean of all per-answer scores.

Default Few-Shot Examples

# Example 1: All statements faithful
{
    "inputs": {
        "questions": "What is the capital of Germany and when was it founded?",
        "contexts": ["Berlin is the capital of Germany and was founded in 1244."],
        "predicted_answers": "The capital of Germany, Berlin, was founded in the 13th century.",
    },
    "outputs": {
        "statements": ["Berlin is the capital of Germany.", "Berlin was founded in 1244."],
        "statement_scores": [1, 1],
    },
}

# Example 2: No statements faithful (context mismatch)
{
    "inputs": {
        "questions": "What is the capital of France?",
        "contexts": ["Berlin is the capital of Germany."],
        "predicted_answers": "Paris",
    },
    "outputs": {
        "statements": ["Paris is the capital of France."],
        "statement_scores": [0],
    },
}

# Example 3: Partial faithfulness
{
    "inputs": {
        "questions": "What is the capital of Italy?",
        "contexts": ["Rome is the capital of Italy."],
        "predicted_answers": "Rome is the capital of Italy with more than 4 million inhabitants.",
    },
    "outputs": {
        "statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
        "statement_scores": [1, 0],
    },
}

Usage Example

from haystack.components.evaluators import FaithfulnessEvaluator

evaluator = FaithfulnessEvaluator()

questions = ["Who created the Python language?"]
contexts = [
    [(
        "Python, created by Guido van Rossum in the late 1980s, is a high-level "
        "general-purpose programming language. Its design philosophy emphasizes code "
        "readability, and its language constructs aim to help programmers write clear, "
        "logical code for both small and large-scale software projects."
    )],
]
predicted_answers = [
    "Python is a high-level general-purpose programming language that was created by George Lucas."
]

result = evaluator.run(
    questions=questions,
    contexts=contexts,
    predicted_answers=predicted_answers,
)

print(result["individual_scores"])
# [0.5]
print(result["score"])
# 0.5
print(result["results"])
# [{'statements': ['Python is a high-level general-purpose programming language.',
#   'Python was created by George Lucas.'], 'statement_scores': [1, 0], 'score': 0.5}]

Using a Custom ChatGenerator

from haystack.components.evaluators import FaithfulnessEvaluator
from haystack.components.generators.chat import OpenAIChatGenerator

custom_llm = OpenAIChatGenerator(
    model="gpt-4o",
    generation_kwargs={"response_format": {"type": "json_object"}},
)
evaluator = FaithfulnessEvaluator(chat_generator=custom_llm)

Important Notes

JSON mode required: The chat generator must be configured to return JSON. For OpenAI, pass {"response_format": {"type": "json_object"}} in generation_kwargs.
API key required: By default uses OpenAI; set OPENAI_API_KEY environment variable.
Non-deterministic: Results may vary between runs due to LLM stochasticity.
NaN handling: Failed API calls produce NaN scores for affected answers.
Progress bar: Enabled by default; can be disabled for batch or headless execution.

Dependencies

haystack core library
numpy -- for mean computation
External LLM API (OpenAI by default)

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Faithfulness_Evaluation

Requires Environment

Environment:Deepset_ai_Haystack_OpenAI_API_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment