Principle:Run llama Llama index Batch Evaluation Setup

Overview

Batch Evaluation Setup addresses the challenge of orchestrating multiple evaluation metrics across an entire set of queries in a systematic, efficient manner. Rather than running each evaluator individually against each query in serial loops, LlamaIndex's BatchEvalRunner provides a structured way to register multiple evaluators, configure parallelism, and execute evaluation passes with controlled concurrency.

Setting up batch evaluation correctly is critical for production RAG systems where evaluation must be comprehensive (multiple metrics), efficient (parallel execution), and consistent (uniform configuration across all evaluators).

RAG Evaluation Batch Processing Pipeline Orchestration Evaluation Infrastructure

Orchestrating Batch Evaluation Across Multiple Metrics

A complete RAG evaluation requires checking multiple quality dimensions simultaneously:

Faithfulness — is the response grounded in the retrieved context?
Relevancy — is the retrieved context relevant to the query?
Correctness — does the response match the expected answer?

Running these evaluators one at a time in sequence is:

Slow — each evaluator makes LLM calls, and serial execution multiplies wall-clock time
Error-prone — manual iteration over queries and evaluators invites bugs
Hard to manage — tracking which queries have been evaluated by which evaluators becomes complex

The BatchEvalRunner solves these problems by accepting a dictionary of named evaluators and managing the execution lifecycle for all of them.

Running Multiple Evaluators in Parallel

The batch runner executes evaluators concurrently using an async worker pool. The key design decisions in setup are:

Evaluator Registration

Evaluators are registered as a dictionary mapping string names to evaluator instances:

Key	Value	Purpose
A descriptive string name (e.g., "faithfulness")	An initialized evaluator instance	The name becomes the key in the results dictionary

This naming convention enables:

Clear result identification — results are keyed by evaluator name, not index
Selective analysis — downstream code can access specific metric results by name
Flexible composition — evaluators can be added or removed without changing other code

Worker Configuration

The workers parameter controls how many evaluation calls execute concurrently:

Too few workers — evaluation is slow, LLM API throughput is underutilized
Too many workers — risk of hitting API rate limits, causing failures and retries
Optimal value — depends on your LLM provider's rate limits and the number of queries

Rate Limiting and Worker Management

Rate limiting is a practical concern for batch evaluation because each evaluation call invokes the judge LLM:

For 100 queries with 3 evaluators, that is 300 LLM calls minimum
Each faithfulness evaluation may require multiple calls for long contexts (refinement)
API providers impose rate limits on tokens-per-minute and requests-per-minute

The workers parameter in BatchEvalRunner serves as a coarse rate limiter by controlling concurrent execution. For more granular control, the underlying LLM can be configured with its own rate limiting.

Best practices for worker configuration:

Start with workers=2 (the default) and increase gradually
Monitor for rate limit errors (HTTP 429) and reduce workers if they occur
For OpenAI GPT-4, 2–4 workers is typically safe for standard tier accounts
For high-throughput evaluation, consider using a dedicated API key with higher rate limits

Structured Evaluation Pipelines

Setting up batch evaluation is one step in a larger evaluation pipeline:

Pipeline Stage	Component	Output
1. Dataset Generation	RagDatasetGenerator	List of evaluation queries with ground truth
2. Evaluator Setup	BatchEvalRunner initialization	Configured runner with registered evaluators
3. Evaluation Execution	BatchEvalRunner.evaluate_queries	Dictionary of EvaluationResults per metric
4. Result Analysis	Aggregation and reporting	Pass rates, scores, failure analysis

The setup stage (this principle) focuses on stage 2: correctly initializing the runner with the right evaluators, appropriate worker counts, and optional progress tracking.

Progress Tracking

The show_progress parameter enables a progress bar during batch evaluation, which is valuable for:

Long-running evaluations — evaluating hundreds of queries can take minutes to hours
Debugging — identifying if evaluation is stuck on a particular query
User experience — providing feedback during interactive evaluation sessions

Knowledge Sources

LlamaIndex Evaluation LlamaIndex BatchEvalRunner

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment