Implementation:Vibrantlabsai Ragas Faithfulness
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Faithfulness measures the factual consistency of an LLM-generated answer against the retrieved contexts by decomposing the answer into individual statements and verifying each one through Natural Language Inference (NLI).
Description
This metric evaluates whether an LLM's response is grounded in and supported by the retrieved contexts. It uses a two-stage pipeline:
Stage 1 - Statement Generation: The answer is decomposed into atomic, self-contained statements using the StatementGeneratorPrompt. This prompt instructs the LLM to break down complex sentences into simple, fully understandable statements without pronouns. For example, "He was a physicist who developed relativity" would be split into two independent statements about the specific person.
Stage 2 - NLI Verification: Each generated statement is evaluated against the concatenated retrieved contexts using the NLIStatementPrompt. For each statement, the LLM returns a binary verdict (1 = can be inferred from context, 0 = cannot be inferred) along with a reason for the classification.
The final faithfulness score is computed as the ratio of faithful statements (verdict = 1) to the total number of statements. A score of 1.0 means every statement in the answer is supported by the contexts. A score of 0.0 means none of the statements could be verified. If no statements are generated, the score returns NaN.
The module also includes FaithfulnesswithHHEM, a variant that replaces the LLM-based NLI step with Vectara's hallucination_evaluation_model (HHEM), a dedicated sequence classification model from HuggingFace. This variant processes statement-context pairs in configurable batches to avoid out-of-memory issues.
Usage
Use this metric to detect hallucinations in RAG pipelines. It is essential for applications where factual accuracy is critical, such as medical QA, legal document analysis, or enterprise knowledge systems. Use the standard Faithfulness metric for LLM-based evaluation, or FaithfulnesswithHHEM for a model-based approach that does not require an external LLM for the NLI step.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_faithfulness.py
Signature
@dataclass
class Faithfulness(MetricWithLLM, SingleTurnMetric):
name: str = "faithfulness"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.SINGLE_TURN: {
"user_input",
"response",
"retrieved_contexts",
}
}
)
output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
nli_statements_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
statement_generator_prompt: PydanticPrompt = field(
default_factory=StatementGeneratorPrompt
)
max_retries: int = 1
@dataclass
class FaithfulnesswithHHEM(Faithfulness):
name: str = "faithfulness_with_hhem"
device: str = "cpu"
batch_size: int = 10
Import
from ragas.metrics import Faithfulness
from ragas.metrics import FaithfulnesswithHHEM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| user_input | str | Yes | The user's question or query |
| response | str | Yes | The LLM-generated answer to evaluate for faithfulness |
| retrieved_contexts | List[str] | Yes | The list of retrieved context strings used as the grounding source |
Configuration (FaithfulnesswithHHEM)
| Name | Type | Default | Description |
|---|---|---|---|
| device | str | "cpu" | The device to run the HHEM model on (e.g., "cpu", "cuda") |
| batch_size | int | 10 | Number of statement-context pairs to process per batch |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A continuous score between 0 and 1 representing the fraction of answer statements that are faithful to the retrieved contexts. Returns NaN if no statements are generated. |
Key Components
Statement Generation
| Class | Description |
|---|---|
| StatementGeneratorInput | Pydantic model with question and answer fields |
| StatementGeneratorOutput | Pydantic model containing a list of generated statement strings |
| StatementGeneratorPrompt | PydanticPrompt that decomposes an answer into atomic statements; includes one few-shot example about Albert Einstein |
NLI Verification
| Class | Description |
|---|---|
| NLIStatementInput | Pydantic model with a context string and list of statements to verify |
| StatementFaithfulnessAnswer | Pydantic model for a single verdict with statement, reason, and binary verdict (0 or 1) |
| NLIStatementOutput | Pydantic model wrapping a list of StatementFaithfulnessAnswer items |
| NLIStatementPrompt | PydanticPrompt that judges faithfulness of statements against a context; includes two few-shot examples (student scenario and photosynthesis/Einstein mismatch) |
HHEM Variant
FaithfulnesswithHHEM extends Faithfulness by overriding the _ascore method. Instead of using the LLM-based NLI prompt, it:
- Creates (premise, statement) pairs where the premise is the concatenated retrieved contexts
- Processes pairs in batches using _create_batch to avoid memory issues
- Uses the Vectara hallucination_evaluation_model to predict binary faithfulness scores
- Returns the mean of all batch scores
Usage Examples
Basic Usage
from ragas.metrics import Faithfulness
from ragas.dataset_schema import SingleTurnSample
metric = Faithfulness()
# metric.llm = your_llm_instance
sample = SingleTurnSample(
user_input="What courses is John taking?",
response="John is taking Data Structures, Algorithms, and Artificial Intelligence.",
retrieved_contexts=[
"John is enrolled in Data Structures, Algorithms, and Database Management this semester."
]
)
# score = await metric.single_turn_ascore(sample)
# "Artificial Intelligence" is not in the context, so faithfulness will be less than 1.0
Using FaithfulnesswithHHEM
from ragas.metrics import FaithfulnesswithHHEM
from ragas.dataset_schema import SingleTurnSample
# Requires: pip install transformers
metric = FaithfulnesswithHHEM(device="cpu", batch_size=10)
# metric.llm = your_llm_instance # Still needed for statement generation
sample = SingleTurnSample(
user_input="What is photosynthesis?",
response="Photosynthesis converts light energy into chemical energy in plants.",
retrieved_contexts=[
"Photosynthesis is a process used by plants to convert light energy into chemical energy."
]
)
# score = await metric.single_turn_ascore(sample)
Using the Pre-instantiated Default
from ragas.metrics._faithfulness import faithfulness
# The module provides a pre-instantiated default:
# faithfulness = Faithfulness()