Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas AnswerRelevance

From Leeroopedia
Knowledge Sources
Domains Evaluation, Metrics
Last Updated 2026-02-12 00:00 GMT

Overview

AnswerRelevancy (also known as ResponseRelevancy) scores how relevant a generated answer is to the original question by generating reverse questions from the answer and measuring their cosine similarity to the original question.

Description

The AnswerRelevancy metric evaluates the relevance of an answer by using a novel reverse-question approach. Rather than directly comparing the answer to the question, the metric uses an LLM to generate questions that the answer could plausibly be responding to. These generated questions are then compared against the original question using cosine similarity of their embedding vectors.

The core idea is that a highly relevant answer should produce generated questions that are very similar to the original question. Answers that contain incomplete, redundant, or unnecessary information will produce generated questions that diverge from the original, resulting in lower similarity scores.

The metric also incorporates a noncommittal detection mechanism. If the answer is determined to be evasive, vague, or ambiguous (e.g., "I don't know" or "I'm not sure"), the answer is flagged as noncommittal and the score is set to zero regardless of the cosine similarity.

The strictness parameter controls how many questions are generated per answer (default 3). The final score is the mean cosine similarity across all generated questions, multiplied by a binary noncommittal flag.

The class hierarchy consists of ResponseRelevancy (the base implementation) and AnswerRelevancy (a subclass alias). The pre-instantiated answer_relevancy singleton is available for convenience.

Usage

Use this metric when you want to evaluate whether an LLM's response actually addresses the question that was asked, without requiring a reference answer. It is particularly effective for detecting off-topic responses, overly generic answers, or responses that dodge the question.

Code Reference

Source Location

Signature

@dataclass
class ResponseRelevancy(MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric):
    name: str = "answer_relevancy"
    output_type = MetricOutputType.CONTINUOUS
    question_generation: PydanticPrompt = ResponseRelevancePrompt()
    strictness: int = 3

class AnswerRelevancy(ResponseRelevancy):
    ...

Import

from ragas.metrics import AnswerRelevancy

I/O Contract

Inputs

Name Type Required Description
user_input str Yes The original question or prompt from the user
response str Yes The generated answer to evaluate for relevance
strictness int No Number of questions generated per answer for similarity comparison (default 3, ideal range 3-5)

Outputs

Name Type Description
score float Relevance score between 0.0 and 1.0 (mean cosine similarity of generated questions to original question, zeroed out if answer is noncommittal)

Internal Components

ResponseRelevancePrompt

The ResponseRelevancePrompt is a PydanticPrompt that takes a ResponseRelevanceInput (containing the response text) and produces a ResponseRelevanceOutput with a generated question and a noncommittal flag (0 or 1).

Similarity Calculation

The calculate_similarity method computes cosine similarity between the original question embedding and all generated question embeddings:

def calculate_similarity(self, question: str, generated_questions: list[str]):
    question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
    gen_question_vec = np.asarray(
        self.embeddings.embed_documents(generated_questions)
    ).reshape(len(generated_questions), -1)
    norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
        question_vec, axis=1
    )
    return np.dot(gen_question_vec, question_vec.T).reshape(-1) / norm

Score Computation

The final score multiplies mean cosine similarity by a noncommittal indicator:

cosine_sim = self.calculate_similarity(question, gen_questions)
score = cosine_sim.mean() * int(not all_noncommittal)

Usage Examples

Basic Usage

from ragas.metrics import AnswerRelevancy
from ragas import evaluate
from datasets import Dataset

data = {
    "user_input": ["Where was Albert Einstein born?"],
    "response": ["Albert Einstein was born in Germany."],
}
dataset = Dataset.from_dict(data)

results = evaluate(dataset, metrics=[AnswerRelevancy()])
print(results)

Custom Strictness

from ragas.metrics import AnswerRelevancy
from ragas.dataset_schema import SingleTurnSample

# Use higher strictness for more robust evaluation
relevancy = AnswerRelevancy()
relevancy.strictness = 5

sample = SingleTurnSample(
    user_input="What is photosynthesis?",
    response="Photosynthesis is the process by which plants convert sunlight into energy.",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment