Overview
ResponseGroundedness is a modern metric that evaluates how well grounded a response is in retrieved contexts using a dual-judge evaluation approach.
Description
ResponseGroundedness extends BaseMetric and implements a dual-judge evaluation strategy inspired by NVIDIA's proven approach. It uses two distinct judge prompts (ResponseGroundednessJudge1Prompt and ResponseGroundednessJudge2Prompt) to evaluate groundedness from different perspectives, then averages their scores for a more robust result. Each judge rates on a 0-2 integer scale: 0 (not grounded), 1 (partially grounded), 2 (fully grounded). The raw ratings are converted to a 0.0 to 1.0 float scale by dividing by 2.0, and the final score is the average of both judges. The metric includes retry logic with a configurable maximum number of retries (default 5) for handling invalid ratings or exceptions from the LLM. When a judge produces an invalid rating after all retries, it returns NaN, and the _average_scores method gracefully handles NaN values by using whichever judge returned a valid score (or NaN if both failed). The metric validates that both response and retrieved_contexts are non-empty before evaluation, and returns a score of 0.0 for empty-after-strip edge cases.
Usage
Use this metric to evaluate the groundedness of LLM-generated responses against retrieved context documents in RAG (Retrieval-Augmented Generation) pipelines. It is particularly useful when you need a robust groundedness score that mitigates single-judge bias.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/collections/response_groundedness/metric.py
Signature
class ResponseGroundedness(BaseMetric):
llm: "InstructorBaseRagasLLM"
def __init__(
self,
llm: "InstructorBaseRagasLLM",
name: str = "response_groundedness",
max_retries: int = 5,
**kwargs,
):
Import
from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness
I/O Contract
Inputs (__init__)
| Name |
Type |
Required |
Description
|
| llm |
InstructorBaseRagasLLM |
Yes |
Modern instructor-based LLM used for dual-judge evaluation
|
| name |
str |
No |
The metric name; defaults to "response_groundedness"
|
| max_retries |
int |
No |
Maximum retry attempts for invalid ratings; defaults to 5
|
Inputs (ascore)
| Name |
Type |
Required |
Description
|
| response |
str |
Yes |
The response text to evaluate for groundedness
|
| retrieved_contexts |
List[str] |
Yes |
The retrieved context documents to check groundedness against
|
Outputs
| Name |
Type |
Description
|
| ascore return |
MetricResult |
A MetricResult with a float value between 0.0 and 1.0 representing groundedness; higher is better
|
Key Methods
| Method |
Description
|
ascore(response, retrieved_contexts) |
Main evaluation method: validates inputs, runs both judge prompts, averages scores, and returns a MetricResult
|
_get_judge_rating(prompt_obj, response, context) |
Gets a rating from a single judge with retry logic; returns the rating as a float or NaN on failure
|
_average_scores(score1, score2) |
Averages two judge scores with NaN handling; falls back to whichever score is valid
|
Scoring Details
| Raw Rating |
Meaning |
Normalized Score
|
| 0 |
Not grounded |
0.0
|
| 1 |
Partially grounded |
0.5
|
| 2 |
Fully grounded |
1.0
|
The final score is the average of both judges' normalized scores.
Usage Examples
Basic Usage
import instructor
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness
# Setup the LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)
# Create the metric
metric = ResponseGroundedness(llm=llm)
# Evaluate groundedness
result = await metric.ascore(
response="Einstein was born in Germany in 1879.",
retrieved_contexts=[
"Albert Einstein was born in Ulm, Germany on March 14, 1879."
],
)
print(f"Groundedness score: {result.value}")
Multiple Contexts
from ragas.metrics.collections.response_groundedness.metric import ResponseGroundedness
metric = ResponseGroundedness(llm=my_llm, max_retries=3)
result = await metric.ascore(
response="Python was created by Guido van Rossum and first released in 1991.",
retrieved_contexts=[
"Python is a high-level programming language created by Guido van Rossum.",
"Python was first released in 1991 as a successor to the ABC language.",
],
)
print(f"Groundedness: {result.value:.2f}") # Expected: high score
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.