Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepset ai Haystack Faithfulness Evaluation

From Leeroopedia

Overview

Faithfulness evaluation assesses whether generated answers are grounded in the provided context, detecting hallucination. It measures the proportion of claims in a generated answer that can be supported by the given context documents.

Domains

  • Evaluation
  • NLP

Theoretical Foundation

In Retrieval-Augmented Generation (RAG) systems, a language model generates answers based on retrieved context. However, LLMs may produce statements that are not supported by the context -- a phenomenon known as hallucination. Faithfulness evaluation quantifies this problem.

LLM-as-Judge Approach

Faithfulness evaluation uses an LLM as a judge to:

  1. Extract statements: Decompose the generated answer into individual atomic statements.
  2. Verify each statement: Determine whether each statement can be inferred from the provided context.
  3. Score: Assign a binary score (1 = faithful, 0 = not faithful) to each statement.
Faithfulness = (number of faithful statements) / (total number of statements)

Statement Extraction

The LLM first breaks a generated answer into discrete, verifiable claims. For example:

  • Answer: "Berlin, the capital of Germany, was founded in 1244."
  • Statements: ["Berlin is the capital of Germany.", "Berlin was founded in 1244."]

Each statement is then independently assessed against the context.

Scoring Model

For each statement, the LLM determines:

  • Score 1: The statement can be inferred from the provided context.
  • Score 0: The statement cannot be inferred from the provided context (hallucination or unsupported claim).

The per-answer faithfulness score is the mean of all statement scores. The aggregate score is the mean across all answers.

Few-Shot Prompting

The evaluation uses few-shot examples to guide the LLM judge. Default examples cover:

  • Fully faithful answers (all statements supported).
  • Completely unfaithful answers (no statements supported by context).
  • Partially faithful answers (some statements supported, others not).

Custom examples can be provided to adapt to specific domains or evaluation criteria.

When to Use Faithfulness Evaluation

  • RAG pipeline quality assessment: To measure how well the LLM grounds its answers in retrieved context.
  • Hallucination detection: To identify and quantify fabricated claims in generated outputs.
  • Model comparison: To compare different LLMs or prompt strategies on grounding quality.

Limitations

  • LLM judge reliability: The evaluation is only as good as the LLM judge. Different LLMs may produce different faithfulness assessments.
  • Statement granularity: The decomposition into statements depends on the LLM's interpretation, which may vary.
  • Context completeness: If the context is incomplete, genuinely correct statements may be scored as unfaithful.
  • Cost: Requires LLM API calls for each evaluation, incurring latency and cost.
  • Non-deterministic: Results may vary across runs due to LLM stochasticity.

Relationship to Implementation

In the Haystack framework, this principle is realized by the FaithfulnessEvaluator component, which:

  • Extends LLMEvaluator with faithfulness-specific prompting and scoring.
  • Uses OpenAI (via OpenAIChatGenerator) by default, but accepts any compatible ChatGenerator.
  • Returns per-statement scores, per-answer scores, and the aggregate faithfulness score.

External Dependencies

This is a wrapper component that relies on an external LLM API (by default, OpenAI API via the OpenAIChatGenerator). An API key and network access are required.

Related Principles

  • Context Relevance Evaluation -- evaluates whether the retrieved context is relevant to the question (upstream quality).
  • Semantic Answer Similarity Evaluation -- evaluates answer quality via embedding similarity rather than grounding.

References

  • Es, S. et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation."
  • Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena."

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment