Implementation:Vibrantlabsai Ragas ContextRelevanceV2
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Evaluates whether retrieved contexts are pertinent to the user's question using a dual-judge LLM evaluation system that averages two independent assessments for robust scoring.
Description
The ContextRelevance metric (V2 collections implementation) measures how relevant retrieved context documents are to the user's input question. It uses a dual-judge evaluation system inspired by NVIDIA's approach:
1. Judge 1 evaluates context relevance using a direct relevance prompt (ContextRelevanceJudge1Prompt). 2. Judge 2 evaluates from an alternative perspective for fairness (ContextRelevanceJudge2Prompt). 3. The final score is the average of both judges' ratings.
Each judge assigns a rating on a 0-1-2 scale:
- 0 - Not relevant
- 1 - Partially relevant
- 2 - Fully relevant
The raw ratings are converted to the 0.0-1.0 scale by dividing by 2, then averaged. If one judge fails to produce a valid rating after retries, the other judge's score is used. If both fail, the result is NaN.
The metric includes retry logic (configurable via max_retries, default 5) to handle cases where the LLM returns invalid ratings. It also handles several edge cases: empty inputs return 0.0, and cases where the context exactly matches or is contained within the user input also return 0.0.
The retrieved contexts list is joined with newline characters into a single string before evaluation. Structured prompts use ContextRelevanceInput and ContextRelevanceOutput data classes for communication with the LLM.
Usage
Use this metric to evaluate the quality of a retrieval system in a RAG pipeline. A high score indicates that the retrieved documents are relevant to the user's question, while a low score suggests the retrieval is returning irrelevant content.
This is the V2 collections version which uses modern instructor-based LLMs with structured output and a dual-judge system for more robust evaluation compared to the V1 single-judge approach.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/collections/context_relevance/metric.py
Signature
class ContextRelevance(BaseMetric):
def __init__(
self,
llm: "InstructorBaseRagasLLM",
name: str = "context_relevance",
max_retries: int = 5,
**kwargs,
): ...
async def ascore(
self, user_input: str, retrieved_contexts: List[str]
) -> MetricResult: ...
Import
from ragas.metrics.collections import ContextRelevance
I/O Contract
Constructor Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| llm | InstructorBaseRagasLLM | Yes | Modern instructor-based LLM used for dual-judge evaluation |
| name | str | No | Metric name (default: "context_relevance") |
| max_retries | int | No | Maximum retry attempts when the LLM returns an invalid rating (default: 5) |
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| user_input | str | Yes | The original question posed by the user. Must be non-empty |
| retrieved_contexts | List[str] | Yes | List of retrieved context strings to evaluate for relevance. Must be non-empty |
Outputs
| Name | Type | Description |
|---|---|---|
| score | MetricResult (float value) | Context relevance score between 0.0 and 1.0. Higher is better. May be NaN if both judges fail to produce valid ratings |
Usage Examples
Basic Usage
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import ContextRelevance
# Setup dependencies
client = AsyncOpenAI()
llm = llm_factory("openai", client=client, model="gpt-4o")
# Create metric instance
metric = ContextRelevance(llm=llm)
# Single evaluation
result = await metric.ascore(
user_input="When was Einstein born?",
retrieved_contexts=["Albert Einstein was born on March 14, 1879 in Ulm, Germany."]
)
print(f"Context Relevance: {result.value}")
Multiple Contexts
from ragas.metrics.collections import ContextRelevance
metric = ContextRelevance(llm=llm)
result = await metric.ascore(
user_input="What are the health benefits of green tea?",
retrieved_contexts=[
"Green tea contains antioxidants called catechins that may reduce inflammation.",
"The history of tea drinking dates back to ancient China.",
"Studies suggest green tea may lower the risk of heart disease.",
]
)
print(f"Context Relevance: {result.value}")