Implementation:Vibrantlabsai Ragas Faithfulness

Knowledge Sources	Vibrantlabsai_Ragas
Domains	Evaluation, Metrics
Last Updated	2026-02-12 00:00 GMT

Overview

Faithfulness measures the factual consistency of an LLM-generated answer against the retrieved contexts by decomposing the answer into individual statements and verifying each one through Natural Language Inference (NLI).

Description

This metric evaluates whether an LLM's response is grounded in and supported by the retrieved contexts. It uses a two-stage pipeline:

Stage 1 - Statement Generation: The answer is decomposed into atomic, self-contained statements using the StatementGeneratorPrompt. This prompt instructs the LLM to break down complex sentences into simple, fully understandable statements without pronouns. For example, "He was a physicist who developed relativity" would be split into two independent statements about the specific person.

Stage 2 - NLI Verification: Each generated statement is evaluated against the concatenated retrieved contexts using the NLIStatementPrompt. For each statement, the LLM returns a binary verdict (1 = can be inferred from context, 0 = cannot be inferred) along with a reason for the classification.

The final faithfulness score is computed as the ratio of faithful statements (verdict = 1) to the total number of statements. A score of 1.0 means every statement in the answer is supported by the contexts. A score of 0.0 means none of the statements could be verified. If no statements are generated, the score returns NaN.

The module also includes FaithfulnesswithHHEM, a variant that replaces the LLM-based NLI step with Vectara's hallucination_evaluation_model (HHEM), a dedicated sequence classification model from HuggingFace. This variant processes statement-context pairs in configurable batches to avoid out-of-memory issues.

Usage

Use this metric to detect hallucinations in RAG pipelines. It is essential for applications where factual accuracy is critical, such as medical QA, legal document analysis, or enterprise knowledge systems. Use the standard Faithfulness metric for LLM-based evaluation, or FaithfulnesswithHHEM for a model-based approach that does not require an external LLM for the NLI step.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/metrics/_faithfulness.py

Signature

@dataclass
class Faithfulness(MetricWithLLM, SingleTurnMetric):
    name: str = "faithfulness"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.SINGLE_TURN: {
                "user_input",
                "response",
                "retrieved_contexts",
            }
        }
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
    nli_statements_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
    statement_generator_prompt: PydanticPrompt = field(
        default_factory=StatementGeneratorPrompt
    )
    max_retries: int = 1

@dataclass
class FaithfulnesswithHHEM(Faithfulness):
    name: str = "faithfulness_with_hhem"
    device: str = "cpu"
    batch_size: int = 10

Import

from ragas.metrics import Faithfulness
from ragas.metrics import FaithfulnesswithHHEM

I/O Contract

Inputs

Name	Type	Required	Description
user_input	str	Yes	The user's question or query
response	str	Yes	The LLM-generated answer to evaluate for faithfulness
retrieved_contexts	List[str]	Yes	The list of retrieved context strings used as the grounding source

Configuration (FaithfulnesswithHHEM)

Name	Type	Default	Description
device	str	"cpu"	The device to run the HHEM model on (e.g., "cpu", "cuda")
batch_size	int	10	Number of statement-context pairs to process per batch

Outputs

Name	Type	Description
score	float	A continuous score between 0 and 1 representing the fraction of answer statements that are faithful to the retrieved contexts. Returns NaN if no statements are generated.

Key Components

Statement Generation

Class	Description
StatementGeneratorInput	Pydantic model with question and answer fields
StatementGeneratorOutput	Pydantic model containing a list of generated statement strings
StatementGeneratorPrompt	PydanticPrompt that decomposes an answer into atomic statements; includes one few-shot example about Albert Einstein

NLI Verification

Class	Description
NLIStatementInput	Pydantic model with a context string and list of statements to verify
StatementFaithfulnessAnswer	Pydantic model for a single verdict with statement, reason, and binary verdict (0 or 1)
NLIStatementOutput	Pydantic model wrapping a list of StatementFaithfulnessAnswer items
NLIStatementPrompt	PydanticPrompt that judges faithfulness of statements against a context; includes two few-shot examples (student scenario and photosynthesis/Einstein mismatch)

HHEM Variant

FaithfulnesswithHHEM extends Faithfulness by overriding the _ascore method. Instead of using the LLM-based NLI prompt, it:

Creates (premise, statement) pairs where the premise is the concatenated retrieved contexts
Processes pairs in batches using _create_batch to avoid memory issues
Uses the Vectara hallucination_evaluation_model to predict binary faithfulness scores
Returns the mean of all batch scores

Usage Examples

Basic Usage

from ragas.metrics import Faithfulness
from ragas.dataset_schema import SingleTurnSample

metric = Faithfulness()
# metric.llm = your_llm_instance

sample = SingleTurnSample(
    user_input="What courses is John taking?",
    response="John is taking Data Structures, Algorithms, and Artificial Intelligence.",
    retrieved_contexts=[
        "John is enrolled in Data Structures, Algorithms, and Database Management this semester."
    ]
)

# score = await metric.single_turn_ascore(sample)
# "Artificial Intelligence" is not in the context, so faithfulness will be less than 1.0

Using FaithfulnesswithHHEM

from ragas.metrics import FaithfulnesswithHHEM
from ragas.dataset_schema import SingleTurnSample

# Requires: pip install transformers
metric = FaithfulnesswithHHEM(device="cpu", batch_size=10)
# metric.llm = your_llm_instance  # Still needed for statement generation

sample = SingleTurnSample(
    user_input="What is photosynthesis?",
    response="Photosynthesis converts light energy into chemical energy in plants.",
    retrieved_contexts=[
        "Photosynthesis is a process used by plants to convert light energy into chemical energy."
    ]
)

# score = await metric.single_turn_ascore(sample)

Using the Pre-instantiated Default

from ragas.metrics._faithfulness import faithfulness

# The module provides a pre-instantiated default:
# faithfulness = Faithfulness()

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment