Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Evidentlyai Evidently Context Relevance

From Leeroopedia
Revision as of 12:27, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Evidentlyai_Evidently_Context_Relevance.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Descriptors, LLM Evaluation, RAG, NLP
Last Updated 2026-02-14 12:00 GMT

Overview

Descriptor that evaluates the relevance of retrieved context to input questions using either semantic similarity (sentence transformers) or LLM-based scoring, with configurable aggregation methods.

Description

This module implements the ContextRelevance descriptor and its supporting scoring/aggregation infrastructure for evaluating RAG (Retrieval-Augmented Generation) pipelines.

Scoring Methods:

  • semantic_similarity_scoring -- Computes cosine similarity between question and context embeddings using the all-MiniLM-L6-v2 SentenceTransformer model. Handles exploded context lists (multiple contexts per row) and returns per-context similarity scores as a list column.
  • llm_scoring -- Uses an LLM (default: gpt-4o-mini via OpenAI) to classify each context as "RELEVANT" or "IRRELEVANT" relative to the question. Uses a BinaryClassificationPromptTemplate with reasoning and scoring enabled. Returns per-context scores as a list column.

Aggregation Methods:

  • MeanAggregation -- Computes the average of all context scores (returns Numerical column).
  • HitAggregation -- Returns 1 if any context score meets or exceeds a threshold (default: 0.8), else 0 (returns Categorical column).
  • HitShareAggregation -- Returns the fraction of context scores meeting or exceeding a threshold (returns Categorical column).

ContextRelevance is a Descriptor subclass that orchestrates the full pipeline: it selects a scoring method and aggregation method, computes per-context scores, aggregates them, and optionally outputs raw scores alongside the aggregated result.

Usage

Use ContextRelevance to evaluate how well retrieved documents/contexts match user queries in RAG systems. Add it as a descriptor to a Dataset to compute relevance scores per row, then use Evidently metrics to monitor these scores over time.

Code Reference

Source Location

Signature

def semantic_similarity_scoring(
    question: DatasetColumn, context: DatasetColumn, options: Options
) -> DatasetColumn:

def llm_scoring(
    question: DatasetColumn, context: DatasetColumn, options: Options,
    model: str = "gpt-4o-mini", provider: str = "openai",
) -> DatasetColumn:

class AggregationMethod(Generic[T]):
    column_type: ColumnType
    @abc.abstractmethod
    def do(self, scores: List[float]) -> T:

class MeanAggregation(AggregationMethod[float]):
class HitAggregation(AggregationMethod[int]):
class HitShareAggregation(AggregationMethod[float]):

class ContextRelevance(Descriptor):
    input: str
    contexts: str
    method: str = "semantic_similarity"
    method_params: Optional[Dict[str, object]] = None
    aggregation_method: Optional[str] = None
    aggregation_method_params: Optional[Dict[str, object]] = None
    output_scores: bool = False

    def __init__(self, input: str, contexts: str, method: str = "semantic_similarity", ...):
    def generate_data(self, dataset: Dataset, options: Options) -> Union[DatasetColumn, Dict[DisplayName, DatasetColumn]]:
    def list_input_columns(self) -> Optional[List[str]]:

Import

from evidently.descriptors._context_relevance import ContextRelevance
from evidently.descriptors._context_relevance import MeanAggregation, HitAggregation, HitShareAggregation

I/O Contract

Inputs

Name Type Required Description
input str Yes Column name containing the input/question text
contexts str Yes Column name containing context text (list of strings per row)
method str No (default: "semantic_similarity") Scoring method: "semantic_similarity" or "llm"
method_params Optional[Dict[str, object]] No Additional parameters passed to the scoring method (e.g., model, provider for LLM)
aggregation_method Optional[str] No Aggregation strategy: "mean", "hit", or "hit_share"; defaults to MeanAggregation
aggregation_method_params Optional[Dict[str, object]] No Parameters for aggregation (e.g., threshold for hit/hit_share)
output_scores bool No (default: False) Whether to also output raw per-context scores alongside aggregated score
alias Optional[str] No Custom display name; defaults to "Ranking for {input} with {contexts}"
tests Optional[List[AnyDescriptorTest]] No Tests to apply to the computed descriptor values

Outputs

Name Type Description
return Union[DatasetColumn, Dict[DisplayName, DatasetColumn]] Single aggregated score column, or dict with aggregated scores and raw scores if output_scores=True

Usage Examples

from evidently.descriptors._context_relevance import ContextRelevance

# Semantic similarity scoring with mean aggregation (defaults)
descriptor = ContextRelevance(
    input="question",
    contexts="retrieved_docs",
)

# LLM-based scoring with hit aggregation
descriptor = ContextRelevance(
    input="question",
    contexts="retrieved_docs",
    method="llm",
    method_params={"model": "gpt-4o-mini", "provider": "openai"},
    aggregation_method="hit",
    aggregation_method_params={"threshold": 0.7},
    output_scores=True,
)

# Add to dataset
dataset.add_descriptors([descriptor])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment