Principle:Mlflow Mlflow GenAI Evaluation Execution
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, LLM_Evaluation |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Orchestrating the end-to-end evaluation of a generative AI application by binding together a prepared dataset, a set of scorers, and an optional predict function within a tracked experiment run.
Description
Evaluation of a generative AI application is more than running scorers over data. It requires coordinating several concerns in the correct order: converting raw data into a canonical format, optionally invoking the application under test to generate outputs and traces, dispatching each row to every scorer, collecting structured assessments, aggregating per-row scores into summary metrics, and persisting all results to an experiment tracking system for later comparison.
The evaluation execution principle defines this orchestration contract. A single evaluation run proceeds through well-defined phases:
- Data normalisation -- The input data, regardless of its original format, is converted into a standardised DataFrame with the canonical columns (inputs, outputs, expectations, trace, tags).
- Prediction (optional) -- If a predict function is provided, it is called for each row to generate outputs and traces. The predict function is validated, wrapped for tracing if needed, and executed with the inputs from each row.
- Scoring -- Each scorer is invoked on each row. The harness introspects the scorer's call signature to pass only the parameters it requires. Scorer failures on individual rows are captured as error assessments rather than aborting the entire run.
- Aggregation -- Per-row feedback values are aggregated according to each scorer's aggregation policy to produce summary metrics (e.g.,
correctness/mean,safety/mean). - Persistence -- All metrics, per-row assessments, and traces are logged to an MLflow experiment run, enabling cross-run comparison and historical tracking.
This phased approach ensures that each concern is handled independently, failures are isolated to individual rows or scorers, and the full provenance of every assessment is recorded.
Usage
Execute an evaluation whenever you need to measure the quality of a generative AI application against defined criteria. Common triggers include: iterating on prompt templates during development, running regression checks in CI pipelines, comparing model versions before deployment, and auditing deployed applications using recorded traces. The execution principle applies regardless of whether outputs are pre-computed or generated on the fly.
Theoretical Basis
The evaluation execution follows a pipeline architecture where each phase transforms or enriches the data flowing through it:
raw data
-> normalise(data) [Phase 1: canonical DataFrame]
-> predict(df, predict_fn) [Phase 2: add outputs + traces]
-> score(df, scorers) [Phase 3: per-row Feedback objects]
-> aggregate(feedbacks) [Phase 4: summary metrics dict]
-> persist(run_id, metrics, df) [Phase 5: log to tracking store]
-> EvaluationResult
Key design properties:
- Idempotent scoring: Scorers receive immutable snapshots of each row. Scorer execution order does not affect results.
- Fault isolation: A scorer that raises an exception on one row produces an error assessment for that row but does not prevent other scorers or other rows from completing.
- Run context: The entire pipeline executes within a single MLflow run, either creating a new run or reusing an active one. This ties all artefacts -- metrics, traces, assessments -- to a single auditable unit.
- Parallelism: Scoring can be parallelised across rows using a configurable thread pool, controlled by the
MLFLOW_GENAI_EVAL_MAX_WORKERSenvironment variable.