Implementation:Evidentlyai Evidently Context Relevance

Knowledge Sources	Evidentlyai_Evidently
Domains	Descriptors, LLM Evaluation, RAG, NLP
Last Updated	2026-02-14 12:00 GMT

Overview

Descriptor that evaluates the relevance of retrieved context to input questions using either semantic similarity (sentence transformers) or LLM-based scoring, with configurable aggregation methods.

Description

This module implements the ContextRelevance descriptor and its supporting scoring/aggregation infrastructure for evaluating RAG (Retrieval-Augmented Generation) pipelines.

Scoring Methods:

semantic_similarity_scoring -- Computes cosine similarity between question and context embeddings using the all-MiniLM-L6-v2 SentenceTransformer model. Handles exploded context lists (multiple contexts per row) and returns per-context similarity scores as a list column.
llm_scoring -- Uses an LLM (default: gpt-4o-mini via OpenAI) to classify each context as "RELEVANT" or "IRRELEVANT" relative to the question. Uses a BinaryClassificationPromptTemplate with reasoning and scoring enabled. Returns per-context scores as a list column.

Aggregation Methods:

MeanAggregation -- Computes the average of all context scores (returns Numerical column).
HitAggregation -- Returns 1 if any context score meets or exceeds a threshold (default: 0.8), else 0 (returns Categorical column).
HitShareAggregation -- Returns the fraction of context scores meeting or exceeding a threshold (returns Categorical column).

ContextRelevance is a Descriptor subclass that orchestrates the full pipeline: it selects a scoring method and aggregation method, computes per-context scores, aggregates them, and optionally outputs raw scores alongside the aggregated result.

Usage

Use ContextRelevance to evaluate how well retrieved documents/contexts match user queries in RAG systems. Add it as a descriptor to a Dataset to compute relevance scores per row, then use Evidently metrics to monitor these scores over time.

Code Reference

Source Location

Repository: Evidentlyai_Evidently
File: src/evidently/descriptors/_context_relevance.py

Signature

def semantic_similarity_scoring(
    question: DatasetColumn, context: DatasetColumn, options: Options
) -> DatasetColumn:

def llm_scoring(
    question: DatasetColumn, context: DatasetColumn, options: Options,
    model: str = "gpt-4o-mini", provider: str = "openai",
) -> DatasetColumn:

class AggregationMethod(Generic[T]):
    column_type: ColumnType
    @abc.abstractmethod
    def do(self, scores: List[float]) -> T:

class MeanAggregation(AggregationMethod[float]):
class HitAggregation(AggregationMethod[int]):
class HitShareAggregation(AggregationMethod[float]):

class ContextRelevance(Descriptor):
    input: str
    contexts: str
    method: str = "semantic_similarity"
    method_params: Optional[Dict[str, object]] = None
    aggregation_method: Optional[str] = None
    aggregation_method_params: Optional[Dict[str, object]] = None
    output_scores: bool = False

    def __init__(self, input: str, contexts: str, method: str = "semantic_similarity", ...):
    def generate_data(self, dataset: Dataset, options: Options) -> Union[DatasetColumn, Dict[DisplayName, DatasetColumn]]:
    def list_input_columns(self) -> Optional[List[str]]:

Import

from evidently.descriptors._context_relevance import ContextRelevance
from evidently.descriptors._context_relevance import MeanAggregation, HitAggregation, HitShareAggregation

I/O Contract

Inputs

Name	Type	Required	Description
input	str	Yes	Column name containing the input/question text
contexts	str	Yes	Column name containing context text (list of strings per row)
method	str	No (default: "semantic_similarity")	Scoring method: "semantic_similarity" or "llm"
method_params	Optional[Dict[str, object]]	No	Additional parameters passed to the scoring method (e.g., model, provider for LLM)
aggregation_method	Optional[str]	No	Aggregation strategy: "mean", "hit", or "hit_share"; defaults to MeanAggregation
aggregation_method_params	Optional[Dict[str, object]]	No	Parameters for aggregation (e.g., threshold for hit/hit_share)
output_scores	bool	No (default: False)	Whether to also output raw per-context scores alongside aggregated score
alias	Optional[str]	No	Custom display name; defaults to "Ranking for {input} with {contexts}"
tests	Optional[List[AnyDescriptorTest]]	No	Tests to apply to the computed descriptor values

Outputs

Name	Type	Description
return	Union[DatasetColumn, Dict[DisplayName, DatasetColumn]]	Single aggregated score column, or dict with aggregated scores and raw scores if output_scores=True

Usage Examples

from evidently.descriptors._context_relevance import ContextRelevance

# Semantic similarity scoring with mean aggregation (defaults)
descriptor = ContextRelevance(
    input="question",
    contexts="retrieved_docs",
)

# LLM-based scoring with hit aggregation
descriptor = ContextRelevance(
    input="question",
    contexts="retrieved_docs",
    method="llm",
    method_params={"model": "gpt-4o-mini", "provider": "openai"},
    aggregation_method="hit",
    aggregation_method_params={"threshold": 0.7},
    output_scores=True,
)

# Add to dataset
dataset.add_descriptors([descriptor])

Related Pages

Environment:Evidentlyai_Evidently_Python_Core_Environment
Evidentlyai_Evidently_Custom_Descriptors - Custom descriptor base classes
Evidentlyai_Evidently_Text_Length_Descriptor - Another descriptor implementation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment