Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas AnswerCorrectness

From Leeroopedia
Knowledge Sources
Domains Evaluation, Metrics
Last Updated 2026-02-12 00:00 GMT

Overview

AnswerCorrectness measures the correctness of a generated answer compared to a ground truth reference by combining factual statement overlap (via an F1-like score) with semantic similarity.

Description

The AnswerCorrectness metric evaluates how correct a generated response is when compared to a known reference answer. It uses a two-stage approach that blends factuality and semantic similarity into a single weighted score.

The factuality component works by first decomposing both the response and the reference into simplified atomic statements using an LLM-based StatementGeneratorPrompt. These statement lists are then fed into a CorrectnessClassifier prompt that categorizes each statement as a True Positive (TP, present in both answer and ground truth), False Positive (FP, present only in the answer), or False Negative (FN, present only in the ground truth). From these counts, a configurable F-beta score is computed, where the beta parameter controls the balance between precision and recall.

The semantic similarity component delegates to the AnswerSimilarity metric, which computes cosine similarity between embeddings of the response and the reference.

The final score is a weighted average of the factuality F-beta score and the semantic similarity score. By default, the weights are [0.75, 0.25], giving 75% weight to factuality and 25% to semantic similarity. Both weights must be non-negative and at least one must be non-zero.

Usage

Use this metric when you need a comprehensive assessment of answer correctness that accounts for both factual accuracy (whether the right facts are stated) and semantic meaning (whether the answer conveys the right meaning). It is particularly useful for question-answering evaluation tasks where both precision and recall of factual content matter.

Code Reference

Source Location

Signature

@dataclass
class AnswerCorrectness(MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric):
    name: str = "answer_correctness"
    output_type = MetricOutputType.CONTINUOUS
    correctness_prompt: PydanticPrompt = field(default_factory=CorrectnessClassifier)
    statement_generator_prompt: PydanticPrompt = field(
        default_factory=StatementGeneratorPrompt
    )
    weights: list[float] = field(default_factory=lambda: [0.75, 0.25])
    beta: float = 1.0
    answer_similarity: t.Optional[AnswerSimilarity] = None
    max_retries: int = 1

Import

from ragas.metrics import AnswerCorrectness

I/O Contract

Inputs

Name Type Required Description
user_input str Yes The question or prompt provided by the user
response str Yes The generated answer to be evaluated
reference str Yes The ground truth answer to compare against
weights list[float] No Two-element list of weights for factuality and semantic similarity (default [0.75, 0.25])
beta float No Beta parameter for the F-beta score; beta > 1 favors recall, beta < 1 favors precision (default 1.0)

Outputs

Name Type Description
score float A weighted average of factuality (F-beta) and semantic similarity, ranging from 0.0 to 1.0

Internal Components

CorrectnessClassifier Prompt

The CorrectnessClassifier is a PydanticPrompt that accepts a QuestionAnswerGroundTruth input (containing the question, answer statements, and ground truth statements) and produces a ClassificationWithReason output. Each statement is classified as TP, FP, or FN with a reason for the classification.

Statement Generation

Both the response and the reference are first decomposed into atomic statements using the StatementGeneratorPrompt (imported from the faithfulness module). This simplification step breaks complex text into individual factual claims for more granular comparison.

Score Computation

The factuality score is computed using the fbeta_score utility function from ragas.metrics.utils:

score = fbeta_score(tp, fp, fn, self.beta)

The final score combines factuality and similarity:

score = np.average([f1_score, similarity_score], weights=self.weights)

Usage Examples

Basic Usage

from ragas.metrics import AnswerCorrectness
from ragas.dataset_schema import SingleTurnSample
from ragas import evaluate
from datasets import Dataset

# Create a dataset for evaluation
data = {
    "user_input": ["What powers the sun?"],
    "response": ["The sun is powered by nuclear fusion."],
    "reference": [
        "The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium."
    ],
}
dataset = Dataset.from_dict(data)

# Evaluate using AnswerCorrectness
results = evaluate(dataset, metrics=[AnswerCorrectness()])
print(results)

Custom Weights

from ragas.metrics import AnswerCorrectness

# Give full weight to factuality, ignore semantic similarity
correctness = AnswerCorrectness(weights=[1.0, 0.0])

# Give equal weight to factuality and semantic similarity
correctness_balanced = AnswerCorrectness(weights=[0.5, 0.5])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment