Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas ResponseGroundednessV2

From Leeroopedia
Knowledge Sources
Domains Metrics, Evaluation, Groundedness, RAG
Last Updated 2026-02-12 00:00 GMT

Overview

ResponseGroundedness is a modern metric that evaluates how well grounded a response is in retrieved contexts using a dual-judge evaluation approach.

Description

ResponseGroundedness extends BaseMetric and implements a dual-judge evaluation strategy inspired by NVIDIA's proven approach. It uses two distinct judge prompts (ResponseGroundednessJudge1Prompt and ResponseGroundednessJudge2Prompt) to evaluate groundedness from different perspectives, then averages their scores for a more robust result. Each judge rates on a 0-2 integer scale: 0 (not grounded), 1 (partially grounded), 2 (fully grounded). The raw ratings are converted to a 0.0 to 1.0 float scale by dividing by 2.0, and the final score is the average of both judges. The metric includes retry logic with a configurable maximum number of retries (default 5) for handling invalid ratings or exceptions from the LLM. When a judge produces an invalid rating after all retries, it returns NaN, and the _average_scores method gracefully handles NaN values by using whichever judge returned a valid score (or NaN if both failed). The metric validates that both response and retrieved_contexts are non-empty before evaluation, and returns a score of 0.0 for empty-after-strip edge cases.

Usage

Use this metric to evaluate the groundedness of LLM-generated responses against retrieved context documents in RAG (Retrieval-Augmented Generation) pipelines. It is particularly useful when you need a robust groundedness score that mitigates single-judge bias.

Code Reference

Source Location

  • Repository: Vibrantlabsai_Ragas
  • File: src/ragas/metrics/collections/response_groundedness/metric.py

Signature

class ResponseGroundedness(BaseMetric):
    llm: "InstructorBaseRagasLLM"

    def __init__(
        self,
        llm: "InstructorBaseRagasLLM",
        name: str = "response_groundedness",
        max_retries: int = 5,
        **kwargs,
    ):

Import

from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness

I/O Contract

Inputs (__init__)

Name Type Required Description
llm InstructorBaseRagasLLM Yes Modern instructor-based LLM used for dual-judge evaluation
name str No The metric name; defaults to "response_groundedness"
max_retries int No Maximum retry attempts for invalid ratings; defaults to 5

Inputs (ascore)

Name Type Required Description
response str Yes The response text to evaluate for groundedness
retrieved_contexts List[str] Yes The retrieved context documents to check groundedness against

Outputs

Name Type Description
ascore return MetricResult A MetricResult with a float value between 0.0 and 1.0 representing groundedness; higher is better

Key Methods

Method Description
ascore(response, retrieved_contexts) Main evaluation method: validates inputs, runs both judge prompts, averages scores, and returns a MetricResult
_get_judge_rating(prompt_obj, response, context) Gets a rating from a single judge with retry logic; returns the rating as a float or NaN on failure
_average_scores(score1, score2) Averages two judge scores with NaN handling; falls back to whichever score is valid

Scoring Details

Raw Rating Meaning Normalized Score
0 Not grounded 0.0
1 Partially grounded 0.5
2 Fully grounded 1.0

The final score is the average of both judges' normalized scores.

Usage Examples

Basic Usage

import instructor
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness

# Setup the LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)

# Create the metric
metric = ResponseGroundedness(llm=llm)

# Evaluate groundedness
result = await metric.ascore(
    response="Einstein was born in Germany in 1879.",
    retrieved_contexts=[
        "Albert Einstein was born in Ulm, Germany on March 14, 1879."
    ],
)
print(f"Groundedness score: {result.value}")

Multiple Contexts

from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness

metric = ResponseGroundedness(llm=my_llm, max_retries=3)

result = await metric.ascore(
    response="Python was created by Guido van Rossum and first released in 1991.",
    retrieved_contexts=[
        "Python is a high-level programming language created by Guido van Rossum.",
        "Python was first released in 1991 as a successor to the ABC language.",
    ],
)
print(f"Groundedness: {result.value:.2f}")  # Expected: high score

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment