Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas Faithfulness

From Leeroopedia
Knowledge Sources
Domains Evaluation, Metrics
Last Updated 2026-02-12 00:00 GMT

Overview

Faithfulness measures the factual consistency of an LLM-generated answer against the retrieved contexts by decomposing the answer into individual statements and verifying each one through Natural Language Inference (NLI).

Description

This metric evaluates whether an LLM's response is grounded in and supported by the retrieved contexts. It uses a two-stage pipeline:

Stage 1 - Statement Generation: The answer is decomposed into atomic, self-contained statements using the StatementGeneratorPrompt. This prompt instructs the LLM to break down complex sentences into simple, fully understandable statements without pronouns. For example, "He was a physicist who developed relativity" would be split into two independent statements about the specific person.

Stage 2 - NLI Verification: Each generated statement is evaluated against the concatenated retrieved contexts using the NLIStatementPrompt. For each statement, the LLM returns a binary verdict (1 = can be inferred from context, 0 = cannot be inferred) along with a reason for the classification.

The final faithfulness score is computed as the ratio of faithful statements (verdict = 1) to the total number of statements. A score of 1.0 means every statement in the answer is supported by the contexts. A score of 0.0 means none of the statements could be verified. If no statements are generated, the score returns NaN.

The module also includes FaithfulnesswithHHEM, a variant that replaces the LLM-based NLI step with Vectara's hallucination_evaluation_model (HHEM), a dedicated sequence classification model from HuggingFace. This variant processes statement-context pairs in configurable batches to avoid out-of-memory issues.

Usage

Use this metric to detect hallucinations in RAG pipelines. It is essential for applications where factual accuracy is critical, such as medical QA, legal document analysis, or enterprise knowledge systems. Use the standard Faithfulness metric for LLM-based evaluation, or FaithfulnesswithHHEM for a model-based approach that does not require an external LLM for the NLI step.

Code Reference

Source Location

Signature

@dataclass
class Faithfulness(MetricWithLLM, SingleTurnMetric):
    name: str = "faithfulness"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.SINGLE_TURN: {
                "user_input",
                "response",
                "retrieved_contexts",
            }
        }
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
    nli_statements_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
    statement_generator_prompt: PydanticPrompt = field(
        default_factory=StatementGeneratorPrompt
    )
    max_retries: int = 1

@dataclass
class FaithfulnesswithHHEM(Faithfulness):
    name: str = "faithfulness_with_hhem"
    device: str = "cpu"
    batch_size: int = 10

Import

from ragas.metrics import Faithfulness
from ragas.metrics import FaithfulnesswithHHEM

I/O Contract

Inputs

Name Type Required Description
user_input str Yes The user's question or query
response str Yes The LLM-generated answer to evaluate for faithfulness
retrieved_contexts List[str] Yes The list of retrieved context strings used as the grounding source

Configuration (FaithfulnesswithHHEM)

Name Type Default Description
device str "cpu" The device to run the HHEM model on (e.g., "cpu", "cuda")
batch_size int 10 Number of statement-context pairs to process per batch

Outputs

Name Type Description
score float A continuous score between 0 and 1 representing the fraction of answer statements that are faithful to the retrieved contexts. Returns NaN if no statements are generated.

Key Components

Statement Generation

Class Description
StatementGeneratorInput Pydantic model with question and answer fields
StatementGeneratorOutput Pydantic model containing a list of generated statement strings
StatementGeneratorPrompt PydanticPrompt that decomposes an answer into atomic statements; includes one few-shot example about Albert Einstein

NLI Verification

Class Description
NLIStatementInput Pydantic model with a context string and list of statements to verify
StatementFaithfulnessAnswer Pydantic model for a single verdict with statement, reason, and binary verdict (0 or 1)
NLIStatementOutput Pydantic model wrapping a list of StatementFaithfulnessAnswer items
NLIStatementPrompt PydanticPrompt that judges faithfulness of statements against a context; includes two few-shot examples (student scenario and photosynthesis/Einstein mismatch)

HHEM Variant

FaithfulnesswithHHEM extends Faithfulness by overriding the _ascore method. Instead of using the LLM-based NLI prompt, it:

  1. Creates (premise, statement) pairs where the premise is the concatenated retrieved contexts
  2. Processes pairs in batches using _create_batch to avoid memory issues
  3. Uses the Vectara hallucination_evaluation_model to predict binary faithfulness scores
  4. Returns the mean of all batch scores

Usage Examples

Basic Usage

from ragas.metrics import Faithfulness
from ragas.dataset_schema import SingleTurnSample

metric = Faithfulness()
# metric.llm = your_llm_instance

sample = SingleTurnSample(
    user_input="What courses is John taking?",
    response="John is taking Data Structures, Algorithms, and Artificial Intelligence.",
    retrieved_contexts=[
        "John is enrolled in Data Structures, Algorithms, and Database Management this semester."
    ]
)

# score = await metric.single_turn_ascore(sample)
# "Artificial Intelligence" is not in the context, so faithfulness will be less than 1.0

Using FaithfulnesswithHHEM

from ragas.metrics import FaithfulnesswithHHEM
from ragas.dataset_schema import SingleTurnSample

# Requires: pip install transformers
metric = FaithfulnesswithHHEM(device="cpu", batch_size=10)
# metric.llm = your_llm_instance  # Still needed for statement generation

sample = SingleTurnSample(
    user_input="What is photosynthesis?",
    response="Photosynthesis converts light energy into chemical energy in plants.",
    retrieved_contexts=[
        "Photosynthesis is a process used by plants to convert light energy into chemical energy."
    ]
)

# score = await metric.single_turn_ascore(sample)

Using the Pre-instantiated Default

from ragas.metrics._faithfulness import faithfulness

# The module provides a pre-instantiated default:
# faithfulness = Faithfulness()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment