Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index ContextRelevancyEvaluator

From Leeroopedia
Knowledge Sources
Domains Evaluation, Relevancy
Last Updated 2026-02-11 19:00 GMT

Overview

Evaluates whether retrieved contexts are relevant to a user query by constructing a SummaryIndex over the contexts and using an LLM to score their relevance.

Description

The ContextRelevancyEvaluator is a concrete implementation of BaseEvaluator that assesses how well the retrieved context documents match a given query. Unlike the AnswerRelevancyEvaluator which evaluates a response string, this evaluator focuses on the quality of the retrieved contexts themselves.

The evaluation uses a two-phase approach:

  1. It wraps the provided context strings into Document objects and builds a SummaryIndex from them.
  2. It creates a query engine from that index using the evaluation and refinement templates, then queries it with the user's question.

The default evaluation template scores on two criteria, each worth 2 points (with partial marks allowed):

  1. Does the retrieved context match the subject matter of the query?
  2. Can the retrieved context exclusively provide a full answer to the query?

The default score_threshold is 4.0, and the final score is normalized by dividing the raw score by this threshold. The DEFAULT_REFINE_TEMPLATE is used when multiple context chunks require synthesized evaluation through the refine step.

The LLM output is parsed using a configurable parser_function that extracts a [RESULT] tag followed by a float value, along with the preceding text as feedback. The response parameter to aevaluate is intentionally ignored since this evaluator only assesses context relevance, not response quality.

Usage

Use this evaluator when you need to assess the quality of your retrieval pipeline independently of the response generation step. It is particularly useful for diagnosing whether poor RAG performance is caused by the retriever or the generator. It works with any LLM and is commonly paired with AnswerRelevancyEvaluator for comprehensive RAG evaluation.

Code Reference

Source Location

  • Repository: Run_llama_Llama_index
  • File: llama-index-core/llama_index/core/evaluation/context_relevancy.py

Signature

class ContextRelevancyEvaluator(BaseEvaluator):
    def __init__(
        self,
        llm: Optional[LLM] = None,
        raise_error: bool = False,
        eval_template: str | BasePromptTemplate | None = None,
        refine_template: str | BasePromptTemplate | None = None,
        score_threshold: float = 4.0,
        parser_function: Callable[
            [str], Tuple[Optional[float], Optional[str]]
        ] = _default_parser_function,
    ) -> None: ...

    async def aevaluate(
        self,
        query: str | None = None,
        response: str | None = None,
        contexts: Sequence[str] | None = None,
        sleep_time_in_seconds: int = 0,
        **kwargs: Any,
    ) -> EvaluationResult: ...

Import

from llama_index.core.evaluation.context_relevancy import ContextRelevancyEvaluator

I/O Contract

Inputs

Name Type Required Description
llm Optional[LLM] No The LLM to use for evaluation. Defaults to Settings.llm.
raise_error bool No Whether to raise a ValueError on unparseable output. Defaults to False.
eval_template str or BasePromptTemplate or None No Custom evaluation prompt template. Defaults to the built-in template.
refine_template str or BasePromptTemplate or None No Custom refinement prompt template for multi-chunk contexts. Defaults to the built-in refine template.
score_threshold float No The maximum raw score used for normalization. Defaults to 4.0.
parser_function Callable No Function to parse LLM output into (score, feedback). Defaults to regex-based parser.
query str Yes (aevaluate) The user query to evaluate against.
contexts Sequence[str] Yes (aevaluate) The retrieved context strings to evaluate.
sleep_time_in_seconds int No (aevaluate) Delay before evaluation for rate limiting. Defaults to 0.

Outputs

Name Type Description
result EvaluationResult Contains the query, contexts, normalized score (0.0-1.0), raw LLM feedback, and invalid_result/invalid_reason if parsing failed.

Usage Examples

from llama_index.core.evaluation.context_relevancy import ContextRelevancyEvaluator
from llama_index.core.llms import OpenAI

# Create the evaluator
evaluator = ContextRelevancyEvaluator(
    llm=OpenAI(model="gpt-4"),
)

# Evaluate retrieved contexts
result = await evaluator.aevaluate(
    query="What is the capital of France?",
    contexts=[
        "France is a country in Western Europe. Its capital is Paris.",
        "Paris has a population of over 2 million people.",
    ],
)

print(f"Score: {result.score}")       # e.g., 0.875 (normalized)
print(f"Feedback: {result.feedback}")  # Detailed LLM feedback

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment