Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Arize ai Phoenix Evaluation Execution

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Pipeline Orchestration, Batch Processing
Last Updated 2026-02-14 00:00 GMT

Overview

Evaluation execution is the orchestration of running one or more evaluators against every row of a dataset, managing concurrency, retries, error collection, and progress reporting to produce augmented results at scale.

Description

Individual evaluator calls are straightforward -- pass a dictionary of input fields and receive a list of Score objects. The challenge arises when evaluations must be performed across hundreds or thousands of records, potentially with multiple evaluators per record, while interacting with rate-limited LLM APIs. Ad-hoc loops quickly become insufficient due to the need for:

  • Batch orchestration: iterating over all (row, evaluator) combinations systematically.
  • Concurrency control: running multiple LLM calls in parallel (async) without overwhelming the provider's rate limits.
  • Retry logic: automatically retrying transient failures (API timeouts, rate-limit rejections) with configurable maximum retries.
  • Error isolation: continuing evaluation of remaining rows when individual evaluations fail, rather than aborting the entire pipeline.
  • Progress reporting: providing real-time feedback on how many evaluations have completed, via tqdm progress bars.
  • Result aggregation: collecting all scores and execution metadata into a structured output that augments the original dataset.

Evaluation execution encapsulates these concerns into two entry-point functions -- evaluate_dataframe() (synchronous) and async_evaluate_dataframe() (asynchronous) -- backed by dedicated executor classes (SyncExecutor and AsyncExecutor) that handle the mechanics of parallel execution, retry, and error reporting.

Usage

Use evaluation execution when you need to:

  • Run one or more evaluators across an entire DataFrame of evaluation cases.
  • Control the degree of parallelism for LLM-based evaluators to balance throughput against rate limits.
  • Retry failed evaluations automatically without manual intervention.
  • Collect detailed execution metadata (status, exceptions, timing) alongside evaluation scores.
  • Produce an augmented DataFrame that can be directly analyzed or uploaded to a monitoring platform.

Theoretical Basis

Execution Model

The execution pipeline follows a task-based model where each (row_index, evaluator_index) pair is an independent task:

Input DataFrame (N rows) x M Evaluators = N * M Tasks

For each task (i, j):
  1. Extract row i as a dictionary (eval_input)
  2. Call evaluator j's evaluate() or async_evaluate() with eval_input
  3. Collect List[Score] on success, or record exception on failure
  4. Retry up to max_retries times on transient errors
  5. Record execution_details (status, exception info, timing)

Synchronous vs. Asynchronous Execution

Property evaluate_dataframe() (Sync) async_evaluate_dataframe() (Async)
Executor SyncExecutor AsyncExecutor
Parallelism Sequential task execution Concurrent with configurable concurrency (default 3)
Evaluator method called evaluator.evaluate() evaluator.async_evaluate()
Ideal for Code-based evaluators, small datasets, debugging LLM-based evaluators, large datasets, production workloads
Rate limiting Handled by LLM's built-in rate limiters Handled by LLM's async rate limiters + concurrency cap

Error Handling Strategy

The exit_on_error parameter controls failure behavior:

  • exit_on_error=True (default for SyncExecutor): the pipeline stops at the first unrecoverable error. This is useful during development to surface issues quickly.
  • exit_on_error=False: the pipeline records the error in execution_details and continues processing remaining tasks. This is recommended for production workloads where partial results are preferable to no results.

Each task is retried up to max_retries times (default 10) before being marked as failed. The retry mechanism is especially important for LLM evaluators that encounter transient API errors.

Result Augmentation

After all tasks complete, the results are merged back into a copy of the original DataFrame with two types of additional columns per evaluator:

{evaluator.name}_execution_details  -- JSON string with:
  {
    "status": "success" | "error",
    "exceptions": [...],              # list of exception messages (if any)
    "execution_time_sec": float       # wall-clock time for this evaluation
  }

{score.name}_score                  -- JSON string with Score.to_dict() for each Score

This design ensures that the original data is never modified (a copy is returned), and all metadata needed for debugging, auditing, and analysis is preserved alongside the scores.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment