Principle:Deepset ai Haystack Faithfulness Evaluation
Overview
Faithfulness evaluation assesses whether generated answers are grounded in the provided context, detecting hallucination. It measures the proportion of claims in a generated answer that can be supported by the given context documents.
Domains
- Evaluation
- NLP
Theoretical Foundation
In Retrieval-Augmented Generation (RAG) systems, a language model generates answers based on retrieved context. However, LLMs may produce statements that are not supported by the context -- a phenomenon known as hallucination. Faithfulness evaluation quantifies this problem.
LLM-as-Judge Approach
Faithfulness evaluation uses an LLM as a judge to:
- Extract statements: Decompose the generated answer into individual atomic statements.
- Verify each statement: Determine whether each statement can be inferred from the provided context.
- Score: Assign a binary score (1 = faithful, 0 = not faithful) to each statement.
Faithfulness = (number of faithful statements) / (total number of statements)
Statement Extraction
The LLM first breaks a generated answer into discrete, verifiable claims. For example:
- Answer: "Berlin, the capital of Germany, was founded in 1244."
- Statements: ["Berlin is the capital of Germany.", "Berlin was founded in 1244."]
Each statement is then independently assessed against the context.
Scoring Model
For each statement, the LLM determines:
- Score 1: The statement can be inferred from the provided context.
- Score 0: The statement cannot be inferred from the provided context (hallucination or unsupported claim).
The per-answer faithfulness score is the mean of all statement scores. The aggregate score is the mean across all answers.
Few-Shot Prompting
The evaluation uses few-shot examples to guide the LLM judge. Default examples cover:
- Fully faithful answers (all statements supported).
- Completely unfaithful answers (no statements supported by context).
- Partially faithful answers (some statements supported, others not).
Custom examples can be provided to adapt to specific domains or evaluation criteria.
When to Use Faithfulness Evaluation
- RAG pipeline quality assessment: To measure how well the LLM grounds its answers in retrieved context.
- Hallucination detection: To identify and quantify fabricated claims in generated outputs.
- Model comparison: To compare different LLMs or prompt strategies on grounding quality.
Limitations
- LLM judge reliability: The evaluation is only as good as the LLM judge. Different LLMs may produce different faithfulness assessments.
- Statement granularity: The decomposition into statements depends on the LLM's interpretation, which may vary.
- Context completeness: If the context is incomplete, genuinely correct statements may be scored as unfaithful.
- Cost: Requires LLM API calls for each evaluation, incurring latency and cost.
- Non-deterministic: Results may vary across runs due to LLM stochasticity.
Relationship to Implementation
In the Haystack framework, this principle is realized by the FaithfulnessEvaluator component, which:
- Extends
LLMEvaluatorwith faithfulness-specific prompting and scoring. - Uses OpenAI (via
OpenAIChatGenerator) by default, but accepts any compatibleChatGenerator. - Returns per-statement scores, per-answer scores, and the aggregate faithfulness score.
External Dependencies
This is a wrapper component that relies on an external LLM API (by default, OpenAI API via the OpenAIChatGenerator). An API key and network access are required.
Related Principles
- Context Relevance Evaluation -- evaluates whether the retrieved context is relevant to the question (upstream quality).
- Semantic Answer Similarity Evaluation -- evaluates answer quality via embedding similarity rather than grounding.
References
- Es, S. et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation."
- Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena."