Implementation:Vibrantlabsai Ragas FactualCorrectness
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, RAG Metrics, Factual Verification |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
FactualCorrectness is a metric that evaluates the factual accuracy of LLM-generated responses by decomposing them into atomic claims and verifying each claim against a reference text using Natural Language Inference (NLI).
Description
FactualCorrectness is an LLM-powered metric that measures how factually accurate a model's response is compared to a reference (ground truth) text. It combines two key NLP techniques: claim decomposition (breaking text into atomic, independently verifiable statements) and NLI-based verification (determining whether a reference text entails or contradicts each claim).
The evaluation pipeline works in two stages:
Stage 1 -- Claim Decomposition: The response text is broken down into a list of atomic claims using the ClaimDecompositionPrompt. The granularity is controlled by two parameters: atomicity (low or high -- how finely to split claims) and coverage (low or high -- how many aspects to capture). The module pre-defines four combinations through the DecompositionType enum, each with its own set of few-shot examples demonstrating the expected decomposition behavior.
Stage 2 -- NLI Verification: Each decomposed claim is verified against the reference text using the NLIStatementPrompt (imported from the faithfulness module). The verification produces a boolean array indicating which claims are supported by the reference.
The final score is computed using one of three modes:
- precision: What fraction of the response's claims are supported by the reference (TP / (TP + FP)).
- recall: What fraction of the reference's claims are captured by the response (TP / (TP + FN)).
- f1: The F-beta score combining precision and recall, with a configurable beta parameter (default 1.0 for standard F1; beta > 1 weights recall more heavily).
For f1 and recall modes, the metric runs decomposition and verification in both directions (response-against-reference and reference-against-response) concurrently using asyncio.gather. For precision mode, only one direction is needed.
Usage
Import FactualCorrectness when you need to evaluate whether an LLM's response contains factually accurate information relative to a known reference answer. This metric is appropriate for question-answering systems, summarization tasks, and any RAG pipeline where faithfulness to source material matters. The metric requires an LLM to be set and expects response and reference fields in the evaluation sample.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_factual_correctness.py
Signature
@dataclass
class FactualCorrectness(MetricWithLLM, SingleTurnMetric):
name: str = "factual_correctness"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {MetricType.SINGLE_TURN: {"response", "reference"}}
)
output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
mode: t.Literal["precision", "recall", "f1"] = "f1"
beta: float = 1.0
atomicity: t.Literal["low", "high"] = "low"
coverage: t.Literal["low", "high"] = "low"
claim_decomposition_prompt: PydanticPrompt = field(
default_factory=ClaimDecompositionPrompt
)
nli_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
language: str = "english"
async def decompose_claims(
self, response: str, callbacks: Callbacks
) -> t.List[str]: ...
async def verify_claims(
self, premise: str, hypothesis_list: t.List[str], callbacks: Callbacks
) -> np.ndarray: ...
async def _single_turn_ascore(
self, sample: SingleTurnSample, callbacks: Callbacks
) -> float: ...
async def decompose_and_verify_claims(
self, reference: str, response: str, callbacks: Callbacks
) -> np.ndarray: ...
Import
from ragas.metrics import FactualCorrectness
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| response | str | Yes | The LLM-generated response text to evaluate |
| reference | str | Yes | The ground truth reference text to verify against |
Configuration Parameters
| Name | Type | Default | Description |
|---|---|---|---|
| mode | Literal["precision", "recall", "f1"] | "f1" | Evaluation mode controlling which directional verification is performed |
| beta | float | 1.0 | Beta parameter for F-beta score; >1 favors recall, <1 favors precision |
| atomicity | Literal["low", "high"] | "low" | Granularity of claim decomposition; "high" produces more atomic claims |
| coverage | Literal["low", "high"] | "low" | Breadth of claim decomposition; "high" captures more aspects of the text |
| claim_decomposition_prompt | PydanticPrompt | ClaimDecompositionPrompt() | The prompt used for claim decomposition (customizable) |
| nli_prompt | PydanticPrompt | NLIStatementPrompt() | The prompt used for NLI verification (customizable) |
| language | str | "english" | Language of the evaluation content |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A continuous score between 0.0 and 1.0 representing factual correctness (rounded to 2 decimal places) |
Usage Examples
Basic Usage (F1 Mode)
from ragas.metrics import FactualCorrectness
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import llm_factory
llm = llm_factory("gpt-4o-mini")
metric = FactualCorrectness()
metric.llm = llm
sample = SingleTurnSample(
response="Albert Einstein was a German theoretical physicist who developed the theory of relativity.",
reference="Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity and received the 1921 Nobel Prize in Physics.",
)
score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (F1): {score}")
Precision-Only Mode
metric = FactualCorrectness(mode="precision")
metric.llm = llm
score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (Precision): {score}")
High Atomicity and High Coverage
metric = FactualCorrectness(
atomicity="high",
coverage="high",
mode="f1",
beta=1.0,
)
metric.llm = llm
score = await metric.single_turn_ascore(sample)
print(f"Factual Correctness (High Atomicity/Coverage): {score}")
Decomposition Types
The module defines four decomposition strategies through the DecompositionType enum, each with two few-shot examples:
| Decomposition Type | Atomicity | Coverage | Behavior |
|---|---|---|---|
| LOW_ATOMICITY_LOW_COVERAGE | Low | Low | Produces fewer, broader claims that may omit some details |
| LOW_ATOMICITY_HIGH_COVERAGE | Low | High | Produces fewer claims but tries to capture all information in each |
| HIGH_ATOMICITY_LOW_COVERAGE | High | Low | Produces many fine-grained claims but may skip some aspects |
| HIGH_ATOMICITY_HIGH_COVERAGE | High | High | Produces the most granular and comprehensive claim decomposition |
For example, given: "Charles Babbage was a French mathematician, philosopher, and food critic."
- LOW_ATOMICITY_LOW_COVERAGE: ["Charles Babbage was a mathematician and philosopher."]
- HIGH_ATOMICITY_HIGH_COVERAGE: ["Charles Babbage was a mathematician.", "Charles Babbage was a philosopher.", "Charles Babbage was a food critic.", "Charles Babbage was French."]
Scoring Algorithm
The scoring follows standard information retrieval metrics:
# TP = claims in response verified by reference
# FP = claims in response NOT verified by reference
# FN = claims in reference NOT covered by response (only for recall/f1)
if mode == "precision":
score = tp / (tp + fp + 1e-8)
elif mode == "recall":
score = tp / (tp + fn + 1e-8)
else: # f1
score = fbeta_score(tp, fp, fn, beta)
The fbeta_score utility from ragas.metrics.utils computes the weighted harmonic mean of precision and recall.
Related Pages
- PydanticPrompt - Base prompt class for ClaimDecompositionPrompt and NLIStatementPrompt
- NLIStatementPrompt - The Natural Language Inference prompt imported from the faithfulness module
- MetricWithLLM - Mixin providing LLM integration for metrics
- SingleTurnMetric - Base class for single-turn evaluation metrics
- fbeta_score - Utility function for computing F-beta scores
- Vibrantlabsai_Ragas_ContextPrecision - Another Ragas evaluation metric for retrieval quality