Principle:Run llama Llama index Batch Evaluation Execution

Overview

Batch Evaluation Execution is the runtime phase where configured evaluators are applied to a set of queries or pre-computed responses. This principle covers the three execution modes provided by BatchEvalRunner: evaluating through a query engine (end-to-end), evaluating pre-existing response objects, and evaluating raw response strings with explicit contexts. Understanding these execution modes and their trade-offs is essential for building flexible, efficient evaluation workflows.

RAG Evaluation Batch Processing Pipeline Execution Evaluation Workflows

Executing Batch Evaluations Over Query Sets

The most common evaluation scenario involves running a set of test queries through a query engine and then evaluating the resulting responses. This end-to-end approach:

Captures the full pipeline behavior — retrieval, synthesis, and any post-processing all contribute to the evaluated response
Provides source context automatically — the query engine's response includes retrieved source nodes, which the evaluators use as context
Mirrors production usage — the evaluation exercises the same code path that real users would trigger

The flow is:

Step	Action	Result
1	Submit query to query engine	Response with source nodes
2	Extract response text and contexts	Inputs for evaluators
3	Run all registered evaluators	EvaluationResult per evaluator per query
4	Collect results into structured dictionary	Results keyed by evaluator name

Running Queries Through a Query Engine

The evaluate_queries method represents the highest-level execution mode. The caller provides a query engine and a list of query strings. The runner handles:

Query execution — submitting each query to the engine
Response extraction — pulling response text and source contexts from the engine's output
Evaluator dispatch — sending the query-response-context tuple to each registered evaluator
Result collection — aggregating all results into a structured dictionary

This is the preferred method when you want to evaluate the entire RAG pipeline as a unit.

Pre-Computed Response Evaluation

Sometimes you already have query engine responses (for example, from a prior batch run or from a logging system). The evaluate_responses method accepts:

A list of query strings
A list of corresponding Response objects (which include source nodes)

This mode is useful for:

Offline evaluation — evaluating responses collected from production logs
Comparative evaluation — running the same responses through different evaluators or different judge LLMs
Reproducible evaluation — re-evaluating stored responses without re-running the query engine

Raw String Evaluation

The evaluate_response_strs method provides the lowest-level execution mode, accepting:

A list of query strings
A list of response strings (plain text, not Response objects)
A list of context lists (explicit contexts for each query)

This mode is useful for:

Cross-framework evaluation — evaluating responses from non-LlamaIndex systems
Custom pipeline evaluation — when responses come from custom code that does not produce LlamaIndex Response objects
Testing — providing hand-crafted responses and contexts for unit testing evaluators

Parallel Execution with Worker Pools

All execution methods leverage the workers parameter configured during BatchEvalRunner initialization. The parallel execution model works as follows:

Task creation — each (query, evaluator) pair becomes an independent task
Worker pool — tasks are dispatched to a pool of async workers
Concurrent execution — multiple tasks execute simultaneously, bounded by the worker count
Result aggregation — as tasks complete, results are collected and organized by evaluator name

For example, with 50 queries and 3 evaluators at workers=4:

Total tasks: 150 (50 queries x 3 evaluators)
Maximum concurrent tasks: 4
Each task involves at least one LLM call to the judge model

Evaluation Workflows

End-to-End Evaluation Workflow

The simplest workflow for evaluating a RAG pipeline:

Generate evaluation questions from documents
Initialize evaluators and batch runner
Run evaluate_queries with the query engine and generated questions
Analyze results

A/B Testing Workflow

Comparing two pipeline configurations:

Run the same queries through Pipeline A and Pipeline B
Store the responses from each
Run evaluate_responses on both sets of responses using the same evaluators
Compare metrics side by side

Incremental Evaluation Workflow

For ongoing production monitoring:

Collect query-response pairs from production logs
Periodically run evaluate_response_strs on new pairs
Track metrics over time to detect quality degradation

Keyword Arguments Propagation

All execution methods accept **eval_kwargs_lists for passing evaluator-specific parameters. This is important for:

CorrectnessEvaluator — which requires reference answers not available from the query engine
Custom evaluators — which may need additional context or configuration per query

The kwargs are structured as lists parallel to the query list, allowing per-query customization of evaluation parameters.

Knowledge Sources

LlamaIndex Evaluation LlamaIndex BatchEvalRunner

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment