Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Batch Evaluation Setup

From Leeroopedia

Overview

Batch Evaluation Setup addresses the challenge of orchestrating multiple evaluation metrics across an entire set of queries in a systematic, efficient manner. Rather than running each evaluator individually against each query in serial loops, LlamaIndex's BatchEvalRunner provides a structured way to register multiple evaluators, configure parallelism, and execute evaluation passes with controlled concurrency.

Setting up batch evaluation correctly is critical for production RAG systems where evaluation must be comprehensive (multiple metrics), efficient (parallel execution), and consistent (uniform configuration across all evaluators).

RAG Evaluation Batch Processing Pipeline Orchestration Evaluation Infrastructure

Orchestrating Batch Evaluation Across Multiple Metrics

A complete RAG evaluation requires checking multiple quality dimensions simultaneously:

  • Faithfulness — is the response grounded in the retrieved context?
  • Relevancy — is the retrieved context relevant to the query?
  • Correctness — does the response match the expected answer?

Running these evaluators one at a time in sequence is:

  • Slow — each evaluator makes LLM calls, and serial execution multiplies wall-clock time
  • Error-prone — manual iteration over queries and evaluators invites bugs
  • Hard to manage — tracking which queries have been evaluated by which evaluators becomes complex

The BatchEvalRunner solves these problems by accepting a dictionary of named evaluators and managing the execution lifecycle for all of them.

Running Multiple Evaluators in Parallel

The batch runner executes evaluators concurrently using an async worker pool. The key design decisions in setup are:

Evaluator Registration

Evaluators are registered as a dictionary mapping string names to evaluator instances:

Key Value Purpose
A descriptive string name (e.g., "faithfulness") An initialized evaluator instance The name becomes the key in the results dictionary

This naming convention enables:

  • Clear result identification — results are keyed by evaluator name, not index
  • Selective analysis — downstream code can access specific metric results by name
  • Flexible composition — evaluators can be added or removed without changing other code

Worker Configuration

The workers parameter controls how many evaluation calls execute concurrently:

  • Too few workers — evaluation is slow, LLM API throughput is underutilized
  • Too many workers — risk of hitting API rate limits, causing failures and retries
  • Optimal value — depends on your LLM provider's rate limits and the number of queries

Rate Limiting and Worker Management

Rate limiting is a practical concern for batch evaluation because each evaluation call invokes the judge LLM:

  • For 100 queries with 3 evaluators, that is 300 LLM calls minimum
  • Each faithfulness evaluation may require multiple calls for long contexts (refinement)
  • API providers impose rate limits on tokens-per-minute and requests-per-minute

The workers parameter in BatchEvalRunner serves as a coarse rate limiter by controlling concurrent execution. For more granular control, the underlying LLM can be configured with its own rate limiting.

Best practices for worker configuration:

  • Start with workers=2 (the default) and increase gradually
  • Monitor for rate limit errors (HTTP 429) and reduce workers if they occur
  • For OpenAI GPT-4, 2–4 workers is typically safe for standard tier accounts
  • For high-throughput evaluation, consider using a dedicated API key with higher rate limits

Structured Evaluation Pipelines

Setting up batch evaluation is one step in a larger evaluation pipeline:

Pipeline Stage Component Output
1. Dataset Generation RagDatasetGenerator List of evaluation queries with ground truth
2. Evaluator Setup BatchEvalRunner initialization Configured runner with registered evaluators
3. Evaluation Execution BatchEvalRunner.evaluate_queries Dictionary of EvaluationResults per metric
4. Result Analysis Aggregation and reporting Pass rates, scores, failure analysis

The setup stage (this principle) focuses on stage 2: correctly initializing the runner with the right evaluators, appropriate worker counts, and optional progress tracking.

Progress Tracking

The show_progress parameter enables a progress bar during batch evaluation, which is valuable for:

  • Long-running evaluations — evaluating hundreds of queries can take minutes to hours
  • Debugging — identifying if evaluation is stuck on a particular query
  • User experience — providing feedback during interactive evaluation sessions

Knowledge Sources

LlamaIndex Evaluation LlamaIndex BatchEvalRunner

Related

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment