Implementation:Evidentlyai Evidently Context Relevance
| Knowledge Sources | |
|---|---|
| Domains | Descriptors, LLM Evaluation, RAG, NLP |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Descriptor that evaluates the relevance of retrieved context to input questions using either semantic similarity (sentence transformers) or LLM-based scoring, with configurable aggregation methods.
Description
This module implements the ContextRelevance descriptor and its supporting scoring/aggregation infrastructure for evaluating RAG (Retrieval-Augmented Generation) pipelines.
Scoring Methods:
- semantic_similarity_scoring -- Computes cosine similarity between question and context embeddings using the all-MiniLM-L6-v2 SentenceTransformer model. Handles exploded context lists (multiple contexts per row) and returns per-context similarity scores as a list column.
- llm_scoring -- Uses an LLM (default: gpt-4o-mini via OpenAI) to classify each context as "RELEVANT" or "IRRELEVANT" relative to the question. Uses a BinaryClassificationPromptTemplate with reasoning and scoring enabled. Returns per-context scores as a list column.
Aggregation Methods:
- MeanAggregation -- Computes the average of all context scores (returns Numerical column).
- HitAggregation -- Returns 1 if any context score meets or exceeds a threshold (default: 0.8), else 0 (returns Categorical column).
- HitShareAggregation -- Returns the fraction of context scores meeting or exceeding a threshold (returns Categorical column).
ContextRelevance is a Descriptor subclass that orchestrates the full pipeline: it selects a scoring method and aggregation method, computes per-context scores, aggregates them, and optionally outputs raw scores alongside the aggregated result.
Usage
Use ContextRelevance to evaluate how well retrieved documents/contexts match user queries in RAG systems. Add it as a descriptor to a Dataset to compute relevance scores per row, then use Evidently metrics to monitor these scores over time.
Code Reference
Source Location
- Repository: Evidentlyai_Evidently
- File: src/evidently/descriptors/_context_relevance.py
Signature
def semantic_similarity_scoring(
question: DatasetColumn, context: DatasetColumn, options: Options
) -> DatasetColumn:
def llm_scoring(
question: DatasetColumn, context: DatasetColumn, options: Options,
model: str = "gpt-4o-mini", provider: str = "openai",
) -> DatasetColumn:
class AggregationMethod(Generic[T]):
column_type: ColumnType
@abc.abstractmethod
def do(self, scores: List[float]) -> T:
class MeanAggregation(AggregationMethod[float]):
class HitAggregation(AggregationMethod[int]):
class HitShareAggregation(AggregationMethod[float]):
class ContextRelevance(Descriptor):
input: str
contexts: str
method: str = "semantic_similarity"
method_params: Optional[Dict[str, object]] = None
aggregation_method: Optional[str] = None
aggregation_method_params: Optional[Dict[str, object]] = None
output_scores: bool = False
def __init__(self, input: str, contexts: str, method: str = "semantic_similarity", ...):
def generate_data(self, dataset: Dataset, options: Options) -> Union[DatasetColumn, Dict[DisplayName, DatasetColumn]]:
def list_input_columns(self) -> Optional[List[str]]:
Import
from evidently.descriptors._context_relevance import ContextRelevance
from evidently.descriptors._context_relevance import MeanAggregation, HitAggregation, HitShareAggregation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | str | Yes | Column name containing the input/question text |
| contexts | str | Yes | Column name containing context text (list of strings per row) |
| method | str | No (default: "semantic_similarity") | Scoring method: "semantic_similarity" or "llm" |
| method_params | Optional[Dict[str, object]] | No | Additional parameters passed to the scoring method (e.g., model, provider for LLM) |
| aggregation_method | Optional[str] | No | Aggregation strategy: "mean", "hit", or "hit_share"; defaults to MeanAggregation |
| aggregation_method_params | Optional[Dict[str, object]] | No | Parameters for aggregation (e.g., threshold for hit/hit_share) |
| output_scores | bool | No (default: False) | Whether to also output raw per-context scores alongside aggregated score |
| alias | Optional[str] | No | Custom display name; defaults to "Ranking for {input} with {contexts}" |
| tests | Optional[List[AnyDescriptorTest]] | No | Tests to apply to the computed descriptor values |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Union[DatasetColumn, Dict[DisplayName, DatasetColumn]] | Single aggregated score column, or dict with aggregated scores and raw scores if output_scores=True |
Usage Examples
from evidently.descriptors._context_relevance import ContextRelevance
# Semantic similarity scoring with mean aggregation (defaults)
descriptor = ContextRelevance(
input="question",
contexts="retrieved_docs",
)
# LLM-based scoring with hit aggregation
descriptor = ContextRelevance(
input="question",
contexts="retrieved_docs",
method="llm",
method_params={"model": "gpt-4o-mini", "provider": "openai"},
aggregation_method="hit",
aggregation_method_params={"threshold": 0.7},
output_scores=True,
)
# Add to dataset
dataset.add_descriptors([descriptor])
Related Pages
- Environment:Evidentlyai_Evidently_Python_Core_Environment
- Evidentlyai_Evidently_Custom_Descriptors - Custom descriptor base classes
- Evidentlyai_Evidently_Text_Length_Descriptor - Another descriptor implementation