Implementation:Run llama Llama index PairwiseComparisonEvaluator
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Pairwise |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Compares two responses to a query using an LLM judge to determine which is better, with optional consensus enforcement through answer-order flipping.
Description
The PairwiseComparisonEvaluator is a concrete implementation of BaseEvaluator that evaluates the relative quality of a response against a second response (reference) for a given query. It uses a chat-based prompt with system and user message templates that instruct the LLM to act as an impartial judge.
The evaluation produces one of three outcomes:
- A: The first response is better (score=1.0, passing=True)
- B: The second response is better (score=0.0, passing=False)
- C: A tie (score=0.5, passing=None)
The evaluator supports consensus enforcement (enabled by default via enforce_consensus=True) which mitigates position bias by running the evaluation twice with the answer order flipped. The _resolve_results method then reconciles the two evaluations:
- If both evaluations agree on a winner, that result is returned.
- If both evaluations report a tie (score=0.5), the tie result is returned.
- If the evaluations contradict each other, an inconclusive result is returned with score=0.5 and pairwise_source set to EvaluationSource.NEITHER.
The EvaluationSource enum tracks the provenance of the final result: ORIGINAL (from the first evaluation), FLIPPED (from the flipped evaluation), or NEITHER (inconclusive).
The default prompt template instructs the judge to consider helpfulness, relevance, accuracy, depth, creativity, and level of detail, while explicitly warning against length bias and position bias.
Usage
Use this evaluator when comparing the quality of two different response generation approaches (e.g., different prompts, models, or RAG configurations) against the same query. It is commonly used in A/B testing scenarios and model comparison benchmarks. The consensus enforcement feature makes it robust against ordering effects.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File: llama-index-core/llama_index/core/evaluation/pairwise.py
Signature
class PairwiseComparisonEvaluator(BaseEvaluator):
def __init__(
self,
llm: Optional[LLM] = None,
eval_template: Optional[Union[BasePromptTemplate, str]] = None,
parser_function: Callable[
[str], Tuple[Optional[bool], Optional[float], Optional[str]]
] = _default_parser_function,
enforce_consensus: bool = True,
) -> None: ...
async def aevaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
second_response: Optional[str] = None,
reference: Optional[str] = None,
sleep_time_in_seconds: int = 0,
**kwargs: Any,
) -> EvaluationResult: ...
Import
from llama_index.core.evaluation.pairwise import PairwiseComparisonEvaluator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| llm | Optional[LLM] | No | The LLM to use as judge. Defaults to Settings.llm. |
| eval_template | Optional[Union[BasePromptTemplate, str]] | No | Custom evaluation prompt template. Defaults to a ChatPromptTemplate with system and user messages. |
| parser_function | Callable | No | Function to parse LLM output into (passing, score, feedback). Defaults to bracket-tag parser. |
| enforce_consensus | bool | No | Whether to flip answer order and require consistent results. Defaults to True. |
| query | str | Yes (aevaluate) | The user query for which both responses were generated. |
| response | str | Yes (aevaluate) | The first response (Assistant A) to compare. |
| second_response | str | Yes (aevaluate) | The second response (Assistant B) to compare against. |
| reference | Optional[str] | No (aevaluate) | An optional reference answer to provide context for the judge. |
| sleep_time_in_seconds | int | No (aevaluate) | Delay before evaluation for rate limiting. Defaults to 0. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | EvaluationResult | Contains the query, score (1.0=A wins, 0.0=B wins, 0.5=tie/inconclusive), passing (True=A, False=B, None=tie), feedback, and pairwise_source indicating which evaluation round produced the result. |
Usage Examples
from llama_index.core.evaluation.pairwise import PairwiseComparisonEvaluator
from llama_index.core.llms import OpenAI
# Create the evaluator with consensus enforcement
evaluator = PairwiseComparisonEvaluator(
llm=OpenAI(model="gpt-4"),
enforce_consensus=True,
)
# Compare two responses
result = await evaluator.aevaluate(
query="Explain quantum computing in simple terms.",
response="Quantum computing uses qubits that can be 0 and 1 simultaneously.",
second_response="Quantum computers use quantum mechanics to process information faster.",
reference="Quantum computing leverages quantum mechanical phenomena like superposition.",
)
print(f"Score: {result.score}") # 1.0, 0.0, or 0.5
print(f"Passing: {result.passing}") # True (A wins), False (B wins), or None (tie)
print(f"Source: {result.pairwise_source}") # ORIGINAL, FLIPPED, or NEITHER