Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index PairwiseComparisonEvaluator

From Leeroopedia
Revision as of 11:48, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Run_llama_Llama_index_PairwiseComparisonEvaluator.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Evaluation, Pairwise
Last Updated 2026-02-11 19:00 GMT

Overview

Compares two responses to a query using an LLM judge to determine which is better, with optional consensus enforcement through answer-order flipping.

Description

The PairwiseComparisonEvaluator is a concrete implementation of BaseEvaluator that evaluates the relative quality of a response against a second response (reference) for a given query. It uses a chat-based prompt with system and user message templates that instruct the LLM to act as an impartial judge.

The evaluation produces one of three outcomes:

  • A: The first response is better (score=1.0, passing=True)
  • B: The second response is better (score=0.0, passing=False)
  • C: A tie (score=0.5, passing=None)

The evaluator supports consensus enforcement (enabled by default via enforce_consensus=True) which mitigates position bias by running the evaluation twice with the answer order flipped. The _resolve_results method then reconciles the two evaluations:

  • If both evaluations agree on a winner, that result is returned.
  • If both evaluations report a tie (score=0.5), the tie result is returned.
  • If the evaluations contradict each other, an inconclusive result is returned with score=0.5 and pairwise_source set to EvaluationSource.NEITHER.

The EvaluationSource enum tracks the provenance of the final result: ORIGINAL (from the first evaluation), FLIPPED (from the flipped evaluation), or NEITHER (inconclusive).

The default prompt template instructs the judge to consider helpfulness, relevance, accuracy, depth, creativity, and level of detail, while explicitly warning against length bias and position bias.

Usage

Use this evaluator when comparing the quality of two different response generation approaches (e.g., different prompts, models, or RAG configurations) against the same query. It is commonly used in A/B testing scenarios and model comparison benchmarks. The consensus enforcement feature makes it robust against ordering effects.

Code Reference

Source Location

Signature

class PairwiseComparisonEvaluator(BaseEvaluator):
    def __init__(
        self,
        llm: Optional[LLM] = None,
        eval_template: Optional[Union[BasePromptTemplate, str]] = None,
        parser_function: Callable[
            [str], Tuple[Optional[bool], Optional[float], Optional[str]]
        ] = _default_parser_function,
        enforce_consensus: bool = True,
    ) -> None: ...

    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        second_response: Optional[str] = None,
        reference: Optional[str] = None,
        sleep_time_in_seconds: int = 0,
        **kwargs: Any,
    ) -> EvaluationResult: ...

Import

from llama_index.core.evaluation.pairwise import PairwiseComparisonEvaluator

I/O Contract

Inputs

Name Type Required Description
llm Optional[LLM] No The LLM to use as judge. Defaults to Settings.llm.
eval_template Optional[Union[BasePromptTemplate, str]] No Custom evaluation prompt template. Defaults to a ChatPromptTemplate with system and user messages.
parser_function Callable No Function to parse LLM output into (passing, score, feedback). Defaults to bracket-tag parser.
enforce_consensus bool No Whether to flip answer order and require consistent results. Defaults to True.
query str Yes (aevaluate) The user query for which both responses were generated.
response str Yes (aevaluate) The first response (Assistant A) to compare.
second_response str Yes (aevaluate) The second response (Assistant B) to compare against.
reference Optional[str] No (aevaluate) An optional reference answer to provide context for the judge.
sleep_time_in_seconds int No (aevaluate) Delay before evaluation for rate limiting. Defaults to 0.

Outputs

Name Type Description
result EvaluationResult Contains the query, score (1.0=A wins, 0.0=B wins, 0.5=tie/inconclusive), passing (True=A, False=B, None=tie), feedback, and pairwise_source indicating which evaluation round produced the result.

Usage Examples

from llama_index.core.evaluation.pairwise import PairwiseComparisonEvaluator
from llama_index.core.llms import OpenAI

# Create the evaluator with consensus enforcement
evaluator = PairwiseComparisonEvaluator(
    llm=OpenAI(model="gpt-4"),
    enforce_consensus=True,
)

# Compare two responses
result = await evaluator.aevaluate(
    query="Explain quantum computing in simple terms.",
    response="Quantum computing uses qubits that can be 0 and 1 simultaneously.",
    second_response="Quantum computers use quantum mechanics to process information faster.",
    reference="Quantum computing leverages quantum mechanical phenomena like superposition.",
)

print(f"Score: {result.score}")        # 1.0, 0.0, or 0.5
print(f"Passing: {result.passing}")    # True (A wins), False (B wins), or None (tie)
print(f"Source: {result.pairwise_source}")  # ORIGINAL, FLIPPED, or NEITHER

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment