Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas FactualCorrectness

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, RAG Metrics, Factual Verification
Last Updated 2026-02-12 00:00 GMT

Overview

FactualCorrectness is a metric that evaluates the factual accuracy of LLM-generated responses by decomposing them into atomic claims and verifying each claim against a reference text using Natural Language Inference (NLI).

Description

FactualCorrectness is an LLM-powered metric that measures how factually accurate a model's response is compared to a reference (ground truth) text. It combines two key NLP techniques: claim decomposition (breaking text into atomic, independently verifiable statements) and NLI-based verification (determining whether a reference text entails or contradicts each claim).

The evaluation pipeline works in two stages:

Stage 1 -- Claim Decomposition: The response text is broken down into a list of atomic claims using the ClaimDecompositionPrompt. The granularity is controlled by two parameters: atomicity (low or high -- how finely to split claims) and coverage (low or high -- how many aspects to capture). The module pre-defines four combinations through the DecompositionType enum, each with its own set of few-shot examples demonstrating the expected decomposition behavior.

Stage 2 -- NLI Verification: Each decomposed claim is verified against the reference text using the NLIStatementPrompt (imported from the faithfulness module). The verification produces a boolean array indicating which claims are supported by the reference.

The final score is computed using one of three modes:

  • precision: What fraction of the response's claims are supported by the reference (TP / (TP + FP)).
  • recall: What fraction of the reference's claims are captured by the response (TP / (TP + FN)).
  • f1: The F-beta score combining precision and recall, with a configurable beta parameter (default 1.0 for standard F1; beta > 1 weights recall more heavily).

For f1 and recall modes, the metric runs decomposition and verification in both directions (response-against-reference and reference-against-response) concurrently using asyncio.gather. For precision mode, only one direction is needed.

Usage

Import FactualCorrectness when you need to evaluate whether an LLM's response contains factually accurate information relative to a known reference answer. This metric is appropriate for question-answering systems, summarization tasks, and any RAG pipeline where faithfulness to source material matters. The metric requires an LLM to be set and expects response and reference fields in the evaluation sample.

Code Reference

Source Location

Signature

@dataclass
class FactualCorrectness(MetricWithLLM, SingleTurnMetric):
    name: str = "factual_correctness"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {MetricType.SINGLE_TURN: {"response", "reference"}}
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
    mode: t.Literal["precision", "recall", "f1"] = "f1"
    beta: float = 1.0
    atomicity: t.Literal["low", "high"] = "low"
    coverage: t.Literal["low", "high"] = "low"
    claim_decomposition_prompt: PydanticPrompt = field(
        default_factory=ClaimDecompositionPrompt
    )
    nli_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
    language: str = "english"

    async def decompose_claims(
        self, response: str, callbacks: Callbacks
    ) -> t.List[str]: ...

    async def verify_claims(
        self, premise: str, hypothesis_list: t.List[str], callbacks: Callbacks
    ) -> np.ndarray: ...

    async def _single_turn_ascore(
        self, sample: SingleTurnSample, callbacks: Callbacks
    ) -> float: ...

    async def decompose_and_verify_claims(
        self, reference: str, response: str, callbacks: Callbacks
    ) -> np.ndarray: ...

Import

from ragas.metrics import FactualCorrectness

I/O Contract

Inputs

Name Type Required Description
response str Yes The LLM-generated response text to evaluate
reference str Yes The ground truth reference text to verify against

Configuration Parameters

Name Type Default Description
mode Literal["precision", "recall", "f1"] "f1" Evaluation mode controlling which directional verification is performed
beta float 1.0 Beta parameter for F-beta score; >1 favors recall, <1 favors precision
atomicity Literal["low", "high"] "low" Granularity of claim decomposition; "high" produces more atomic claims
coverage Literal["low", "high"] "low" Breadth of claim decomposition; "high" captures more aspects of the text
claim_decomposition_prompt PydanticPrompt ClaimDecompositionPrompt() The prompt used for claim decomposition (customizable)
nli_prompt PydanticPrompt NLIStatementPrompt() The prompt used for NLI verification (customizable)
language str "english" Language of the evaluation content

Outputs

Name Type Description
score float A continuous score between 0.0 and 1.0 representing factual correctness (rounded to 2 decimal places)

Usage Examples

Basic Usage (F1 Mode)

from ragas.metrics import FactualCorrectness
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import llm_factory

llm = llm_factory("gpt-4o-mini")
metric = FactualCorrectness()
metric.llm = llm

sample = SingleTurnSample(
    response="Albert Einstein was a German theoretical physicist who developed the theory of relativity.",
    reference="Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity and received the 1921 Nobel Prize in Physics.",
)

score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (F1): {score}")

Precision-Only Mode

metric = FactualCorrectness(mode="precision")
metric.llm = llm

score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (Precision): {score}")

High Atomicity and High Coverage

metric = FactualCorrectness(
    atomicity="high",
    coverage="high",
    mode="f1",
    beta=1.0,
)
metric.llm = llm

score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (High Atomicity/Coverage): {score}")

Decomposition Types

The module defines four decomposition strategies through the DecompositionType enum, each with two few-shot examples:

Decomposition Type Atomicity Coverage Behavior
LOW_ATOMICITY_LOW_COVERAGE Low Low Produces fewer, broader claims that may omit some details
LOW_ATOMICITY_HIGH_COVERAGE Low High Produces fewer claims but tries to capture all information in each
HIGH_ATOMICITY_LOW_COVERAGE High Low Produces many fine-grained claims but may skip some aspects
HIGH_ATOMICITY_HIGH_COVERAGE High High Produces the most granular and comprehensive claim decomposition

For example, given: "Charles Babbage was a French mathematician, philosopher, and food critic."

  • LOW_ATOMICITY_LOW_COVERAGE: ["Charles Babbage was a mathematician and philosopher."]
  • HIGH_ATOMICITY_HIGH_COVERAGE: ["Charles Babbage was a mathematician.", "Charles Babbage was a philosopher.", "Charles Babbage was a food critic.", "Charles Babbage was French."]

Scoring Algorithm

The scoring follows standard information retrieval metrics:

# TP = claims in response verified by reference
# FP = claims in response NOT verified by reference
# FN = claims in reference NOT covered by response (only for recall/f1)

if mode == "precision":
    score = tp / (tp + fp + 1e-8)
elif mode == "recall":
    score = tp / (tp + fn + 1e-8)
else:  # f1
    score = fbeta_score(tp, fp, fn, beta)

The fbeta_score utility from ragas.metrics.utils computes the weighted harmonic mean of precision and recall.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment