Implementation:Vibrantlabsai Ragas AnswerCorrectness
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
AnswerCorrectness measures the correctness of a generated answer compared to a ground truth reference by combining factual statement overlap (via an F1-like score) with semantic similarity.
Description
The AnswerCorrectness metric evaluates how correct a generated response is when compared to a known reference answer. It uses a two-stage approach that blends factuality and semantic similarity into a single weighted score.
The factuality component works by first decomposing both the response and the reference into simplified atomic statements using an LLM-based StatementGeneratorPrompt. These statement lists are then fed into a CorrectnessClassifier prompt that categorizes each statement as a True Positive (TP, present in both answer and ground truth), False Positive (FP, present only in the answer), or False Negative (FN, present only in the ground truth). From these counts, a configurable F-beta score is computed, where the beta parameter controls the balance between precision and recall.
The semantic similarity component delegates to the AnswerSimilarity metric, which computes cosine similarity between embeddings of the response and the reference.
The final score is a weighted average of the factuality F-beta score and the semantic similarity score. By default, the weights are [0.75, 0.25], giving 75% weight to factuality and 25% to semantic similarity. Both weights must be non-negative and at least one must be non-zero.
Usage
Use this metric when you need a comprehensive assessment of answer correctness that accounts for both factual accuracy (whether the right facts are stated) and semantic meaning (whether the answer conveys the right meaning). It is particularly useful for question-answering evaluation tasks where both precision and recall of factual content matter.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_answer_correctness.py
Signature
@dataclass
class AnswerCorrectness(MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric):
name: str = "answer_correctness"
output_type = MetricOutputType.CONTINUOUS
correctness_prompt: PydanticPrompt = field(default_factory=CorrectnessClassifier)
statement_generator_prompt: PydanticPrompt = field(
default_factory=StatementGeneratorPrompt
)
weights: list[float] = field(default_factory=lambda: [0.75, 0.25])
beta: float = 1.0
answer_similarity: t.Optional[AnswerSimilarity] = None
max_retries: int = 1
Import
from ragas.metrics import AnswerCorrectness
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| user_input | str | Yes | The question or prompt provided by the user |
| response | str | Yes | The generated answer to be evaluated |
| reference | str | Yes | The ground truth answer to compare against |
| weights | list[float] | No | Two-element list of weights for factuality and semantic similarity (default [0.75, 0.25]) |
| beta | float | No | Beta parameter for the F-beta score; beta > 1 favors recall, beta < 1 favors precision (default 1.0) |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A weighted average of factuality (F-beta) and semantic similarity, ranging from 0.0 to 1.0 |
Internal Components
CorrectnessClassifier Prompt
The CorrectnessClassifier is a PydanticPrompt that accepts a QuestionAnswerGroundTruth input (containing the question, answer statements, and ground truth statements) and produces a ClassificationWithReason output. Each statement is classified as TP, FP, or FN with a reason for the classification.
Statement Generation
Both the response and the reference are first decomposed into atomic statements using the StatementGeneratorPrompt (imported from the faithfulness module). This simplification step breaks complex text into individual factual claims for more granular comparison.
Score Computation
The factuality score is computed using the fbeta_score utility function from ragas.metrics.utils:
score = fbeta_score(tp, fp, fn, self.beta)
The final score combines factuality and similarity:
score = np.average([f1_score, similarity_score], weights=self.weights)
Usage Examples
Basic Usage
from ragas.metrics import AnswerCorrectness
from ragas.dataset_schema import SingleTurnSample
from ragas import evaluate
from datasets import Dataset
# Create a dataset for evaluation
data = {
"user_input": ["What powers the sun?"],
"response": ["The sun is powered by nuclear fusion."],
"reference": [
"The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium."
],
}
dataset = Dataset.from_dict(data)
# Evaluate using AnswerCorrectness
results = evaluate(dataset, metrics=[AnswerCorrectness()])
print(results)
Custom Weights
from ragas.metrics import AnswerCorrectness
# Give full weight to factuality, ignore semantic similarity
correctness = AnswerCorrectness(weights=[1.0, 0.0])
# Give equal weight to factuality and semantic similarity
correctness_balanced = AnswerCorrectness(weights=[0.5, 0.5])