Implementation:Deepset ai Haystack FaithfulnessEvaluator
Overview
FaithfulnessEvaluator is a Haystack evaluator component that checks whether a generated answer can be inferred from the provided contexts. It uses an LLM to decompose answers into statements and verify each against the context, producing a faithfulness score from 0.0 to 1.0.
Implements Principle
Principle:Deepset_ai_Haystack_Faithfulness_Evaluation
Source Location
haystack/components/evaluators/faithfulness.py (Lines 51-182)
Import
from haystack.components.evaluators import FaithfulnessEvaluator
Component Registration
FaithfulnessEvaluator is decorated with @component and extends LLMEvaluator, making it a standard Haystack pipeline component with LLM-backed evaluation capabilities.
External Dependencies (Wrapper)
This component wraps an external LLM service:
- Default: OpenAI API via
OpenAIChatGenerator(requiresOPENAI_API_KEYenvironment variable). - Custom: Any
ChatGeneratorinstance can be provided. The LLM must be configured to return JSON output.
API
Constructor
def __init__(
self,
examples: list[dict[str, Any]] | None = None,
progress_bar: bool = True,
raise_on_failure: bool = True,
chat_generator: ChatGenerator | None = None,
):
Parameters:
- examples (
list[dict] | None, default:None) -- Optional few-shot examples for the LLM judge. If not provided, default examples are used. Each example must have"inputs"(with keys"questions","contexts","predicted_answers") and"outputs"(with keys"statements","statement_scores"). - progress_bar (
bool, default:True) -- Whether to show a progress bar during evaluation. - raise_on_failure (
bool, default:True) -- Whether to raise an exception if the API call fails. - chat_generator (
ChatGenerator | None, default:None) -- A ChatGenerator instance representing the LLM. Must be configured for JSON output. IfNone, uses OpenAIChatGenerator.
run()
def run(
self,
questions: list[str],
contexts: list[list[str]],
predicted_answers: list[str]
) -> dict[str, Any]:
Parameters:
- questions (
list[str]) -- A list of questions. - contexts (
list[list[str]]) -- A nested list of context strings, one list per question. - predicted_answers (
list[str]) -- A list of generated answers to evaluate.
Returns: A dictionary with the following keys:
- score (
float) -- Mean faithfulness score over all answers. - individual_scores (
list[int]) -- A list of faithfulness scores for each answer. - results (
list[dict]) -- A list of dictionaries, each containing:- statements (
list[str]) -- The extracted statements from the answer. - statement_scores (
list[int]) -- Binary scores (0 or 1) for each statement. - score (
float) -- Mean score for this answer's statements.
- statements (
to_dict() / from_dict()
Serialization and deserialization methods for pipeline export and import. Handles chat generator serialization.
Internal Architecture
FaithfulnessEvaluator extends LLMEvaluator and configures it with:
- Instructions: A system prompt that directs the LLM to extract statements and score each for faithfulness.
- Input specification:
[("questions", list[str]), ("contexts", list[list[str]]), ("predicted_answers", list[str])] - Output specification:
["statements", "statement_scores"] - Examples: Three default few-shot examples covering fully faithful, unfaithful, and partially faithful scenarios.
Scoring Logic
After the base LLMEvaluator.run() processes each input:
- For each result, compute the mean of
statement_scoresas the per-answerscore. - If a result is
None(API failure), set statements and scores to empty lists and score toNaN. - If statements list is empty, set the score to 0.
- Compute the overall
scoreas the mean of all per-answer scores.
Default Few-Shot Examples
# Example 1: All statements faithful
{
"inputs": {
"questions": "What is the capital of Germany and when was it founded?",
"contexts": ["Berlin is the capital of Germany and was founded in 1244."],
"predicted_answers": "The capital of Germany, Berlin, was founded in the 13th century.",
},
"outputs": {
"statements": ["Berlin is the capital of Germany.", "Berlin was founded in 1244."],
"statement_scores": [1, 1],
},
}
# Example 2: No statements faithful (context mismatch)
{
"inputs": {
"questions": "What is the capital of France?",
"contexts": ["Berlin is the capital of Germany."],
"predicted_answers": "Paris",
},
"outputs": {
"statements": ["Paris is the capital of France."],
"statement_scores": [0],
},
}
# Example 3: Partial faithfulness
{
"inputs": {
"questions": "What is the capital of Italy?",
"contexts": ["Rome is the capital of Italy."],
"predicted_answers": "Rome is the capital of Italy with more than 4 million inhabitants.",
},
"outputs": {
"statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
"statement_scores": [1, 0],
},
}
Usage Example
from haystack.components.evaluators import FaithfulnessEvaluator
evaluator = FaithfulnessEvaluator()
questions = ["Who created the Python language?"]
contexts = [
[(
"Python, created by Guido van Rossum in the late 1980s, is a high-level "
"general-purpose programming language. Its design philosophy emphasizes code "
"readability, and its language constructs aim to help programmers write clear, "
"logical code for both small and large-scale software projects."
)],
]
predicted_answers = [
"Python is a high-level general-purpose programming language that was created by George Lucas."
]
result = evaluator.run(
questions=questions,
contexts=contexts,
predicted_answers=predicted_answers,
)
print(result["individual_scores"])
# [0.5]
print(result["score"])
# 0.5
print(result["results"])
# [{'statements': ['Python is a high-level general-purpose programming language.',
# 'Python was created by George Lucas.'], 'statement_scores': [1, 0], 'score': 0.5}]
Using a Custom ChatGenerator
from haystack.components.evaluators import FaithfulnessEvaluator
from haystack.components.generators.chat import OpenAIChatGenerator
custom_llm = OpenAIChatGenerator(
model="gpt-4o",
generation_kwargs={"response_format": {"type": "json_object"}},
)
evaluator = FaithfulnessEvaluator(chat_generator=custom_llm)
Important Notes
- JSON mode required: The chat generator must be configured to return JSON. For OpenAI, pass
{"response_format": {"type": "json_object"}}ingeneration_kwargs. - API key required: By default uses OpenAI; set
OPENAI_API_KEYenvironment variable. - Non-deterministic: Results may vary between runs due to LLM stochasticity.
- NaN handling: Failed API calls produce
NaNscores for affected answers. - Progress bar: Enabled by default; can be disabled for batch or headless execution.
Dependencies
haystackcore librarynumpy-- for mean computation- External LLM API (OpenAI by default)