Implementation:Vibrantlabsai Ragas FactualCorrectness

Knowledge Sources	Vibrantlabsai_Ragas
Domains	LLM Evaluation, RAG Metrics, Factual Verification
Last Updated	2026-02-12 00:00 GMT

Overview

FactualCorrectness is a metric that evaluates the factual accuracy of LLM-generated responses by decomposing them into atomic claims and verifying each claim against a reference text using Natural Language Inference (NLI).

Description

FactualCorrectness is an LLM-powered metric that measures how factually accurate a model's response is compared to a reference (ground truth) text. It combines two key NLP techniques: claim decomposition (breaking text into atomic, independently verifiable statements) and NLI-based verification (determining whether a reference text entails or contradicts each claim).

The evaluation pipeline works in two stages:

Stage 1 -- Claim Decomposition: The response text is broken down into a list of atomic claims using the ClaimDecompositionPrompt. The granularity is controlled by two parameters: atomicity (low or high -- how finely to split claims) and coverage (low or high -- how many aspects to capture). The module pre-defines four combinations through the DecompositionType enum, each with its own set of few-shot examples demonstrating the expected decomposition behavior.

Stage 2 -- NLI Verification: Each decomposed claim is verified against the reference text using the NLIStatementPrompt (imported from the faithfulness module). The verification produces a boolean array indicating which claims are supported by the reference.

The final score is computed using one of three modes:

precision: What fraction of the response's claims are supported by the reference (TP / (TP + FP)).
recall: What fraction of the reference's claims are captured by the response (TP / (TP + FN)).
f1: The F-beta score combining precision and recall, with a configurable beta parameter (default 1.0 for standard F1; beta > 1 weights recall more heavily).

For f1 and recall modes, the metric runs decomposition and verification in both directions (response-against-reference and reference-against-response) concurrently using asyncio.gather. For precision mode, only one direction is needed.

Usage

Import FactualCorrectness when you need to evaluate whether an LLM's response contains factually accurate information relative to a known reference answer. This metric is appropriate for question-answering systems, summarization tasks, and any RAG pipeline where faithfulness to source material matters. The metric requires an LLM to be set and expects response and reference fields in the evaluation sample.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/metrics/_factual_correctness.py

Signature

@dataclass
class FactualCorrectness(MetricWithLLM, SingleTurnMetric):
    name: str = "factual_correctness"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {MetricType.SINGLE_TURN: {"response", "reference"}}
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
    mode: t.Literal["precision", "recall", "f1"] = "f1"
    beta: float = 1.0
    atomicity: t.Literal["low", "high"] = "low"
    coverage: t.Literal["low", "high"] = "low"
    claim_decomposition_prompt: PydanticPrompt = field(
        default_factory=ClaimDecompositionPrompt
    )
    nli_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
    language: str = "english"

    async def decompose_claims(
        self, response: str, callbacks: Callbacks
    ) -> t.List[str]: ...

    async def verify_claims(
        self, premise: str, hypothesis_list: t.List[str], callbacks: Callbacks
    ) -> np.ndarray: ...

    async def _single_turn_ascore(
        self, sample: SingleTurnSample, callbacks: Callbacks
    ) -> float: ...

    async def decompose_and_verify_claims(
        self, reference: str, response: str, callbacks: Callbacks
    ) -> np.ndarray: ...

Import

from ragas.metrics import FactualCorrectness

I/O Contract

Inputs

Name	Type	Required	Description
response	str	Yes	The LLM-generated response text to evaluate
reference	str	Yes	The ground truth reference text to verify against

Configuration Parameters

Name	Type	Default	Description
mode	Literal["precision", "recall", "f1"]	"f1"	Evaluation mode controlling which directional verification is performed
beta	float	1.0	Beta parameter for F-beta score; >1 favors recall, <1 favors precision
atomicity	Literal["low", "high"]	"low"	Granularity of claim decomposition; "high" produces more atomic claims
coverage	Literal["low", "high"]	"low"	Breadth of claim decomposition; "high" captures more aspects of the text
claim_decomposition_prompt	PydanticPrompt	ClaimDecompositionPrompt()	The prompt used for claim decomposition (customizable)
nli_prompt	PydanticPrompt	NLIStatementPrompt()	The prompt used for NLI verification (customizable)
language	str	"english"	Language of the evaluation content

Outputs

Name	Type	Description
score	float	A continuous score between 0.0 and 1.0 representing factual correctness (rounded to 2 decimal places)

Usage Examples

Basic Usage (F1 Mode)

from ragas.metrics import FactualCorrectness
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import llm_factory

llm = llm_factory("gpt-4o-mini")
metric = FactualCorrectness()
metric.llm = llm

sample = SingleTurnSample(
    response="Albert Einstein was a German theoretical physicist who developed the theory of relativity.",
    reference="Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity and received the 1921 Nobel Prize in Physics.",
)

score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (F1): {score}")

Precision-Only Mode

metric = FactualCorrectness(mode="precision")
metric.llm = llm

score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (Precision): {score}")

High Atomicity and High Coverage

metric = FactualCorrectness(
    atomicity="high",
    coverage="high",
    mode="f1",
    beta=1.0,
)
metric.llm = llm

score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (High Atomicity/Coverage): {score}")

Decomposition Types

The module defines four decomposition strategies through the DecompositionType enum, each with two few-shot examples:

Decomposition Type	Atomicity	Coverage	Behavior
LOW_ATOMICITY_LOW_COVERAGE	Low	Low	Produces fewer, broader claims that may omit some details
LOW_ATOMICITY_HIGH_COVERAGE	Low	High	Produces fewer claims but tries to capture all information in each
HIGH_ATOMICITY_LOW_COVERAGE	High	Low	Produces many fine-grained claims but may skip some aspects
HIGH_ATOMICITY_HIGH_COVERAGE	High	High	Produces the most granular and comprehensive claim decomposition

For example, given: "Charles Babbage was a French mathematician, philosopher, and food critic."

LOW_ATOMICITY_LOW_COVERAGE: ["Charles Babbage was a mathematician and philosopher."]
HIGH_ATOMICITY_HIGH_COVERAGE: ["Charles Babbage was a mathematician.", "Charles Babbage was a philosopher.", "Charles Babbage was a food critic.", "Charles Babbage was French."]

Scoring Algorithm

The scoring follows standard information retrieval metrics:

# TP = claims in response verified by reference
# FP = claims in response NOT verified by reference
# FN = claims in reference NOT covered by response (only for recall/f1)

if mode == "precision":
    score = tp / (tp + fp + 1e-8)
elif mode == "recall":
    score = tp / (tp + fn + 1e-8)
else:  # f1
    score = fbeta_score(tp, fp, fn, beta)

The fbeta_score utility from ragas.metrics.utils computes the weighted harmonic mean of precision and recall.

Related Pages

PydanticPrompt - Base prompt class for ClaimDecompositionPrompt and NLIStatementPrompt
NLIStatementPrompt - The Natural Language Inference prompt imported from the faithfulness module
MetricWithLLM - Mixin providing LLM integration for metrics
SingleTurnMetric - Base class for single-turn evaluation metrics
fbeta_score - Utility function for computing F-beta scores
Vibrantlabsai_Ragas_ContextPrecision - Another Ragas evaluation metric for retrieval quality

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment