Principle:Deepset ai Haystack Faithfulness Evaluation

Overview

Faithfulness evaluation assesses whether generated answers are grounded in the provided context, detecting hallucination. It measures the proportion of claims in a generated answer that can be supported by the given context documents.

Domains

Evaluation
NLP

Theoretical Foundation

In Retrieval-Augmented Generation (RAG) systems, a language model generates answers based on retrieved context. However, LLMs may produce statements that are not supported by the context -- a phenomenon known as hallucination. Faithfulness evaluation quantifies this problem.

LLM-as-Judge Approach

Faithfulness evaluation uses an LLM as a judge to:

Extract statements: Decompose the generated answer into individual atomic statements.
Verify each statement: Determine whether each statement can be inferred from the provided context.
Score: Assign a binary score (1 = faithful, 0 = not faithful) to each statement.

Faithfulness = (number of faithful statements) / (total number of statements)

Statement Extraction

The LLM first breaks a generated answer into discrete, verifiable claims. For example:

Answer: "Berlin, the capital of Germany, was founded in 1244."
Statements: ["Berlin is the capital of Germany.", "Berlin was founded in 1244."]

Each statement is then independently assessed against the context.

Scoring Model

For each statement, the LLM determines:

Score 1: The statement can be inferred from the provided context.
Score 0: The statement cannot be inferred from the provided context (hallucination or unsupported claim).

The per-answer faithfulness score is the mean of all statement scores. The aggregate score is the mean across all answers.

Few-Shot Prompting

The evaluation uses few-shot examples to guide the LLM judge. Default examples cover:

Fully faithful answers (all statements supported).
Completely unfaithful answers (no statements supported by context).
Partially faithful answers (some statements supported, others not).

Custom examples can be provided to adapt to specific domains or evaluation criteria.

When to Use Faithfulness Evaluation

RAG pipeline quality assessment: To measure how well the LLM grounds its answers in retrieved context.
Hallucination detection: To identify and quantify fabricated claims in generated outputs.
Model comparison: To compare different LLMs or prompt strategies on grounding quality.

Limitations

LLM judge reliability: The evaluation is only as good as the LLM judge. Different LLMs may produce different faithfulness assessments.
Statement granularity: The decomposition into statements depends on the LLM's interpretation, which may vary.
Context completeness: If the context is incomplete, genuinely correct statements may be scored as unfaithful.
Cost: Requires LLM API calls for each evaluation, incurring latency and cost.
Non-deterministic: Results may vary across runs due to LLM stochasticity.

Relationship to Implementation

In the Haystack framework, this principle is realized by the FaithfulnessEvaluator component, which:

Extends LLMEvaluator with faithfulness-specific prompting and scoring.
Uses OpenAI (via OpenAIChatGenerator) by default, but accepts any compatible ChatGenerator.
Returns per-statement scores, per-answer scores, and the aggregate faithfulness score.

External Dependencies

This is a wrapper component that relies on an external LLM API (by default, OpenAI API via the OpenAIChatGenerator). An API key and network access are required.

Related Principles

Context Relevance Evaluation -- evaluates whether the retrieved context is relevant to the question (upstream quality).
Semantic Answer Similarity Evaluation -- evaluates answer quality via embedding similarity rather than grounding.

References

Es, S. et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation."
Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena."

Related Pages

Implemented By

Implementation:Deepset_ai_Haystack_FaithfulnessEvaluator

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment