Implementation:Vibrantlabsai Ragas ResponseGroundednessV2

Knowledge Sources	Vibrantlabsai_Ragas
Domains	Metrics, Evaluation, Groundedness, RAG
Last Updated	2026-02-12 00:00 GMT

Overview

ResponseGroundedness is a modern metric that evaluates how well grounded a response is in retrieved contexts using a dual-judge evaluation approach.

Description

ResponseGroundedness extends BaseMetric and implements a dual-judge evaluation strategy inspired by NVIDIA's proven approach. It uses two distinct judge prompts (ResponseGroundednessJudge1Prompt and ResponseGroundednessJudge2Prompt) to evaluate groundedness from different perspectives, then averages their scores for a more robust result. Each judge rates on a 0-2 integer scale: 0 (not grounded), 1 (partially grounded), 2 (fully grounded). The raw ratings are converted to a 0.0 to 1.0 float scale by dividing by 2.0, and the final score is the average of both judges. The metric includes retry logic with a configurable maximum number of retries (default 5) for handling invalid ratings or exceptions from the LLM. When a judge produces an invalid rating after all retries, it returns NaN, and the _average_scores method gracefully handles NaN values by using whichever judge returned a valid score (or NaN if both failed). The metric validates that both response and retrieved_contexts are non-empty before evaluation, and returns a score of 0.0 for empty-after-strip edge cases.

Usage

Use this metric to evaluate the groundedness of LLM-generated responses against retrieved context documents in RAG (Retrieval-Augmented Generation) pipelines. It is particularly useful when you need a robust groundedness score that mitigates single-judge bias.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/metrics/collections/response_groundedness/metric.py

Signature

class ResponseGroundedness(BaseMetric):
    llm: "InstructorBaseRagasLLM"

    def __init__(
        self,
        llm: "InstructorBaseRagasLLM",
        name: str = "response_groundedness",
        max_retries: int = 5,
        **kwargs,
    ):

Import

from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness

I/O Contract

Inputs (init)

Name	Type	Required	Description
llm	InstructorBaseRagasLLM	Yes	Modern instructor-based LLM used for dual-judge evaluation
name	str	No	The metric name; defaults to "response_groundedness"
max_retries	int	No	Maximum retry attempts for invalid ratings; defaults to 5

Inputs (ascore)

Name	Type	Required	Description
response	str	Yes	The response text to evaluate for groundedness
retrieved_contexts	List[str]	Yes	The retrieved context documents to check groundedness against

Outputs

Name	Type	Description
ascore return	MetricResult	A MetricResult with a float value between 0.0 and 1.0 representing groundedness; higher is better

Key Methods

Method	Description
`ascore(response, retrieved_contexts)`	Main evaluation method: validates inputs, runs both judge prompts, averages scores, and returns a MetricResult
`_get_judge_rating(prompt_obj, response, context)`	Gets a rating from a single judge with retry logic; returns the rating as a float or NaN on failure
`_average_scores(score1, score2)`	Averages two judge scores with NaN handling; falls back to whichever score is valid

Scoring Details

Raw Rating	Meaning	Normalized Score
0	Not grounded	0.0
1	Partially grounded	0.5
2	Fully grounded	1.0

The final score is the average of both judges' normalized scores.

Usage Examples

Basic Usage

import instructor
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness

# Setup the LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)

# Create the metric
metric = ResponseGroundedness(llm=llm)

# Evaluate groundedness
result = await metric.ascore(
    response="Einstein was born in Germany in 1879.",
    retrieved_contexts=[
        "Albert Einstein was born in Ulm, Germany on March 14, 1879."
    ],
)
print(f"Groundedness score: {result.value}")

Multiple Contexts

from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness

metric = ResponseGroundedness(llm=my_llm, max_retries=3)

result = await metric.ascore(
    response="Python was created by Guido van Rossum and first released in 1991.",
    retrieved_contexts=[
        "Python is a high-level programming language created by Guido van Rossum.",
        "Python was first released in 1991 as a successor to the ABC language.",
    ],
)
print(f"Groundedness: {result.value:.2f}")  # Expected: high score

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment