Implementation:Run llama Llama index PairwiseComparisonEvaluator

Knowledge Sources	Run_llama_Llama_index
Domains	Evaluation, Pairwise
Last Updated	2026-02-11 19:00 GMT

Overview

Compares two responses to a query using an LLM judge to determine which is better, with optional consensus enforcement through answer-order flipping.

Description

The PairwiseComparisonEvaluator is a concrete implementation of BaseEvaluator that evaluates the relative quality of a response against a second response (reference) for a given query. It uses a chat-based prompt with system and user message templates that instruct the LLM to act as an impartial judge.

The evaluation produces one of three outcomes:

A: The first response is better (score=1.0, passing=True)
B: The second response is better (score=0.0, passing=False)
C: A tie (score=0.5, passing=None)

The evaluator supports consensus enforcement (enabled by default via enforce_consensus=True) which mitigates position bias by running the evaluation twice with the answer order flipped. The _resolve_results method then reconciles the two evaluations:

If both evaluations agree on a winner, that result is returned.
If both evaluations report a tie (score=0.5), the tie result is returned.
If the evaluations contradict each other, an inconclusive result is returned with score=0.5 and pairwise_source set to EvaluationSource.NEITHER.

The EvaluationSource enum tracks the provenance of the final result: ORIGINAL (from the first evaluation), FLIPPED (from the flipped evaluation), or NEITHER (inconclusive).

The default prompt template instructs the judge to consider helpfulness, relevance, accuracy, depth, creativity, and level of detail, while explicitly warning against length bias and position bias.

Usage

Use this evaluator when comparing the quality of two different response generation approaches (e.g., different prompts, models, or RAG configurations) against the same query. It is commonly used in A/B testing scenarios and model comparison benchmarks. The consensus enforcement feature makes it robust against ordering effects.

Code Reference

Source Location

Repository: Run_llama_Llama_index
File: llama-index-core/llama_index/core/evaluation/pairwise.py

Signature

class PairwiseComparisonEvaluator(BaseEvaluator):
    def __init__(
        self,
        llm: Optional[LLM] = None,
        eval_template: Optional[Union[BasePromptTemplate, str]] = None,
        parser_function: Callable[
            [str], Tuple[Optional[bool], Optional[float], Optional[str]]
        ] = _default_parser_function,
        enforce_consensus: bool = True,
    ) -> None: ...

    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        second_response: Optional[str] = None,
        reference: Optional[str] = None,
        sleep_time_in_seconds: int = 0,
        **kwargs: Any,
    ) -> EvaluationResult: ...

Import

from llama_index.core.evaluation.pairwise import PairwiseComparisonEvaluator

I/O Contract

Inputs

Name	Type	Required	Description
llm	Optional[LLM]	No	The LLM to use as judge. Defaults to Settings.llm.
eval_template	Optional[Union[BasePromptTemplate, str]]	No	Custom evaluation prompt template. Defaults to a ChatPromptTemplate with system and user messages.
parser_function	Callable	No	Function to parse LLM output into (passing, score, feedback). Defaults to bracket-tag parser.
enforce_consensus	bool	No	Whether to flip answer order and require consistent results. Defaults to True.
query	str	Yes (aevaluate)	The user query for which both responses were generated.
response	str	Yes (aevaluate)	The first response (Assistant A) to compare.
second_response	str	Yes (aevaluate)	The second response (Assistant B) to compare against.
reference	Optional[str]	No (aevaluate)	An optional reference answer to provide context for the judge.
sleep_time_in_seconds	int	No (aevaluate)	Delay before evaluation for rate limiting. Defaults to 0.

Outputs

Name	Type	Description
result	EvaluationResult	Contains the query, score (1.0=A wins, 0.0=B wins, 0.5=tie/inconclusive), passing (True=A, False=B, None=tie), feedback, and pairwise_source indicating which evaluation round produced the result.

Usage Examples

from llama_index.core.evaluation.pairwise import PairwiseComparisonEvaluator
from llama_index.core.llms import OpenAI

# Create the evaluator with consensus enforcement
evaluator = PairwiseComparisonEvaluator(
    llm=OpenAI(model="gpt-4"),
    enforce_consensus=True,
)

# Compare two responses
result = await evaluator.aevaluate(
    query="Explain quantum computing in simple terms.",
    response="Quantum computing uses qubits that can be 0 and 1 simultaneously.",
    second_response="Quantum computers use quantum mechanics to process information faster.",
    reference="Quantum computing leverages quantum mechanical phenomena like superposition.",
)

print(f"Score: {result.score}")        # 1.0, 0.0, or 0.5
print(f"Passing: {result.passing}")    # True (A wins), False (B wins), or None (tie)
print(f"Source: {result.pairwise_source}")  # ORIGINAL, FLIPPED, or NEITHER

Related Pages

Environment:Run_llama_Llama_index_Python_LlamaIndex_Core

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment