Principle:Mlflow Mlflow Evaluation Result Analysis
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, LLM_Evaluation |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Structuring evaluation outputs into complementary summary and detail views so that practitioners can both track high-level quality trends and diagnose individual failure cases.
Description
Running an evaluation produces a large volume of information: every scorer generates an assessment for every row, and each assessment may include a value, a rationale, and error details. Presenting this information effectively requires two complementary perspectives.
The first perspective is aggregated metrics. For each scorer, per-row values are reduced to summary statistics (e.g., mean, median, p90) according to the scorer's aggregation policy. These metrics provide a single number for each quality dimension, making it straightforward to compare runs, set thresholds for CI gates, and track quality trends over time. Metric names follow a convention of {scorer_name}/{aggregation} (e.g., correctness/mean, safety/mean), making them self-describing and easy to filter in dashboards.
The second perspective is the per-row result table. This is a DataFrame where each row corresponds to one evaluation input, and columns capture the scorer outputs: {scorer_name}/value, {scorer_name}/rationale, {scorer_name}/error_message, and {scorer_name}/error_code. This table enables practitioners to inspect exactly which inputs caused failures, read the LLM judge's reasoning, identify patterns in errors, and construct targeted test cases for regression prevention.
Together, these two views form a complete analysis surface: metrics answer "how well is the application performing?" while the result table answers "where and why is it failing?"
Usage
Analyse evaluation results whenever making decisions about model quality. Use aggregated metrics for automated gates (e.g., "fail the CI build if correctness/mean drops below 0.9"). Use the per-row result table for manual investigation of failure cases, for constructing focused regression datasets, and for communicating specific quality issues to stakeholders.
Theoretical Basis
The two-level structure follows the summary-detail pattern common in measurement and reporting systems:
- Summary level (metrics dict): A fixed-size representation that grows with the number of scorers and aggregation functions, not with the number of evaluation rows. This makes it suitable for time-series tracking and threshold-based alerting.
- Detail level (result DataFrame): A row-per-input representation that preserves full provenance. Each cell can be traced back to a specific scorer invocation on a specific input, enabling root-cause analysis.
The aggregation step maps a vector of per-row values to a scalar:
for each scorer s:
values = [row[s].value for row in results if row[s].value is numeric]
for each aggregation_fn in s.aggregations:
metrics[f"{s.name}/{aggregation_fn.name}"] = aggregation_fn(values)
The result DataFrame is constructed by flattening each row's assessments into a columnar format:
for each eval_result:
row = eval_result.eval_item.to_dict()
for each assessment in eval_result.assessments:
row[f"{assessment.name}/value"] = assessment.value
row[f"{assessment.name}/rationale"] = assessment.rationale
row[f"{assessment.name}/error_message"] = assessment.error_message
row[f"{assessment.name}/error_code"] = assessment.error_code
result_df.append(row)
This flattened columnar layout makes the DataFrame immediately usable with standard data analysis tools (filtering, grouping, sorting) without requiring nested object traversal.