Principle:Run llama Llama index Batch Evaluation Execution
Overview
Batch Evaluation Execution is the runtime phase where configured evaluators are applied to a set of queries or pre-computed responses. This principle covers the three execution modes provided by BatchEvalRunner: evaluating through a query engine (end-to-end), evaluating pre-existing response objects, and evaluating raw response strings with explicit contexts. Understanding these execution modes and their trade-offs is essential for building flexible, efficient evaluation workflows.
RAG Evaluation Batch Processing Pipeline Execution Evaluation Workflows
Executing Batch Evaluations Over Query Sets
The most common evaluation scenario involves running a set of test queries through a query engine and then evaluating the resulting responses. This end-to-end approach:
- Captures the full pipeline behavior — retrieval, synthesis, and any post-processing all contribute to the evaluated response
- Provides source context automatically — the query engine's response includes retrieved source nodes, which the evaluators use as context
- Mirrors production usage — the evaluation exercises the same code path that real users would trigger
The flow is:
| Step | Action | Result |
|---|---|---|
| 1 | Submit query to query engine | Response with source nodes |
| 2 | Extract response text and contexts | Inputs for evaluators |
| 3 | Run all registered evaluators | EvaluationResult per evaluator per query |
| 4 | Collect results into structured dictionary | Results keyed by evaluator name |
Running Queries Through a Query Engine
The evaluate_queries method represents the highest-level execution mode. The caller provides a query engine and a list of query strings. The runner handles:
- Query execution — submitting each query to the engine
- Response extraction — pulling response text and source contexts from the engine's output
- Evaluator dispatch — sending the query-response-context tuple to each registered evaluator
- Result collection — aggregating all results into a structured dictionary
This is the preferred method when you want to evaluate the entire RAG pipeline as a unit.
Pre-Computed Response Evaluation
Sometimes you already have query engine responses (for example, from a prior batch run or from a logging system). The evaluate_responses method accepts:
- A list of query strings
- A list of corresponding Response objects (which include source nodes)
This mode is useful for:
- Offline evaluation — evaluating responses collected from production logs
- Comparative evaluation — running the same responses through different evaluators or different judge LLMs
- Reproducible evaluation — re-evaluating stored responses without re-running the query engine
Raw String Evaluation
The evaluate_response_strs method provides the lowest-level execution mode, accepting:
- A list of query strings
- A list of response strings (plain text, not Response objects)
- A list of context lists (explicit contexts for each query)
This mode is useful for:
- Cross-framework evaluation — evaluating responses from non-LlamaIndex systems
- Custom pipeline evaluation — when responses come from custom code that does not produce LlamaIndex Response objects
- Testing — providing hand-crafted responses and contexts for unit testing evaluators
Parallel Execution with Worker Pools
All execution methods leverage the workers parameter configured during BatchEvalRunner initialization. The parallel execution model works as follows:
- Task creation — each (query, evaluator) pair becomes an independent task
- Worker pool — tasks are dispatched to a pool of async workers
- Concurrent execution — multiple tasks execute simultaneously, bounded by the worker count
- Result aggregation — as tasks complete, results are collected and organized by evaluator name
For example, with 50 queries and 3 evaluators at workers=4:
- Total tasks: 150 (50 queries x 3 evaluators)
- Maximum concurrent tasks: 4
- Each task involves at least one LLM call to the judge model
Evaluation Workflows
End-to-End Evaluation Workflow
The simplest workflow for evaluating a RAG pipeline:
- Generate evaluation questions from documents
- Initialize evaluators and batch runner
- Run evaluate_queries with the query engine and generated questions
- Analyze results
A/B Testing Workflow
Comparing two pipeline configurations:
- Run the same queries through Pipeline A and Pipeline B
- Store the responses from each
- Run evaluate_responses on both sets of responses using the same evaluators
- Compare metrics side by side
Incremental Evaluation Workflow
For ongoing production monitoring:
- Collect query-response pairs from production logs
- Periodically run evaluate_response_strs on new pairs
- Track metrics over time to detect quality degradation
Keyword Arguments Propagation
All execution methods accept **eval_kwargs_lists for passing evaluator-specific parameters. This is important for:
- CorrectnessEvaluator — which requires reference answers not available from the query engine
- Custom evaluators — which may need additional context or configuration per query
The kwargs are structured as lists parallel to the query list, allowing per-query customization of evaluation parameters.
Knowledge Sources
LlamaIndex Evaluation LlamaIndex BatchEvalRunner