Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Batch Evaluation Execution

From Leeroopedia

Overview

Batch Evaluation Execution is the runtime phase where configured evaluators are applied to a set of queries or pre-computed responses. This principle covers the three execution modes provided by BatchEvalRunner: evaluating through a query engine (end-to-end), evaluating pre-existing response objects, and evaluating raw response strings with explicit contexts. Understanding these execution modes and their trade-offs is essential for building flexible, efficient evaluation workflows.

RAG Evaluation Batch Processing Pipeline Execution Evaluation Workflows

Executing Batch Evaluations Over Query Sets

The most common evaluation scenario involves running a set of test queries through a query engine and then evaluating the resulting responses. This end-to-end approach:

  • Captures the full pipeline behavior — retrieval, synthesis, and any post-processing all contribute to the evaluated response
  • Provides source context automatically — the query engine's response includes retrieved source nodes, which the evaluators use as context
  • Mirrors production usage — the evaluation exercises the same code path that real users would trigger

The flow is:

Step Action Result
1 Submit query to query engine Response with source nodes
2 Extract response text and contexts Inputs for evaluators
3 Run all registered evaluators EvaluationResult per evaluator per query
4 Collect results into structured dictionary Results keyed by evaluator name

Running Queries Through a Query Engine

The evaluate_queries method represents the highest-level execution mode. The caller provides a query engine and a list of query strings. The runner handles:

  • Query execution — submitting each query to the engine
  • Response extraction — pulling response text and source contexts from the engine's output
  • Evaluator dispatch — sending the query-response-context tuple to each registered evaluator
  • Result collection — aggregating all results into a structured dictionary

This is the preferred method when you want to evaluate the entire RAG pipeline as a unit.

Pre-Computed Response Evaluation

Sometimes you already have query engine responses (for example, from a prior batch run or from a logging system). The evaluate_responses method accepts:

  • A list of query strings
  • A list of corresponding Response objects (which include source nodes)

This mode is useful for:

  • Offline evaluation — evaluating responses collected from production logs
  • Comparative evaluation — running the same responses through different evaluators or different judge LLMs
  • Reproducible evaluation — re-evaluating stored responses without re-running the query engine

Raw String Evaluation

The evaluate_response_strs method provides the lowest-level execution mode, accepting:

  • A list of query strings
  • A list of response strings (plain text, not Response objects)
  • A list of context lists (explicit contexts for each query)

This mode is useful for:

  • Cross-framework evaluation — evaluating responses from non-LlamaIndex systems
  • Custom pipeline evaluation — when responses come from custom code that does not produce LlamaIndex Response objects
  • Testing — providing hand-crafted responses and contexts for unit testing evaluators

Parallel Execution with Worker Pools

All execution methods leverage the workers parameter configured during BatchEvalRunner initialization. The parallel execution model works as follows:

  • Task creation — each (query, evaluator) pair becomes an independent task
  • Worker pool — tasks are dispatched to a pool of async workers
  • Concurrent execution — multiple tasks execute simultaneously, bounded by the worker count
  • Result aggregation — as tasks complete, results are collected and organized by evaluator name

For example, with 50 queries and 3 evaluators at workers=4:

  • Total tasks: 150 (50 queries x 3 evaluators)
  • Maximum concurrent tasks: 4
  • Each task involves at least one LLM call to the judge model

Evaluation Workflows

End-to-End Evaluation Workflow

The simplest workflow for evaluating a RAG pipeline:

  • Generate evaluation questions from documents
  • Initialize evaluators and batch runner
  • Run evaluate_queries with the query engine and generated questions
  • Analyze results

A/B Testing Workflow

Comparing two pipeline configurations:

  • Run the same queries through Pipeline A and Pipeline B
  • Store the responses from each
  • Run evaluate_responses on both sets of responses using the same evaluators
  • Compare metrics side by side

Incremental Evaluation Workflow

For ongoing production monitoring:

  • Collect query-response pairs from production logs
  • Periodically run evaluate_response_strs on new pairs
  • Track metrics over time to detect quality degradation

Keyword Arguments Propagation

All execution methods accept **eval_kwargs_lists for passing evaluator-specific parameters. This is important for:

  • CorrectnessEvaluator — which requires reference answers not available from the query engine
  • Custom evaluators — which may need additional context or configuration per query

The kwargs are structured as lists parallel to the query list, allowing per-query customization of evaluation parameters.

Knowledge Sources

LlamaIndex Evaluation LlamaIndex BatchEvalRunner

Related

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment