Implementation:Mlflow Mlflow Evaluation Result
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, LLM_Evaluation |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Concrete tool for accessing and interpreting evaluation results -- including aggregated metrics and per-row scores -- provided by the MLflow library.
Description
EvaluationResult is the dataclass returned by mlflow.genai.evaluate(). It bundles three pieces of information:
- run_id -- The MLflow run ID under which the evaluation was logged. This links the results to the experiment tracking store, enabling cross-run comparison and artefact retrieval.
- metrics -- A dictionary mapping metric names to aggregated float values. Metric names follow the convention
{scorer_name}/{aggregation}(e.g.,correctness/mean,safety/mean). This dictionary is also logged to the MLflow run, making it available in the MLflow UI and via the search API. - result_df -- A pandas DataFrame containing per-row evaluation results. Each row corresponds to one evaluation input and includes the original
inputs,outputs,expectations, and scorer assessment columns:{scorer_name}/value,{scorer_name}/rationale,{scorer_name}/error_message, and{scorer_name}/error_code.
For backwards compatibility, a tables property is provided that returns {"eval_results": result_df}.
Usage
Access the EvaluationResult returned by mlflow.genai.evaluate() to inspect aggregated quality metrics, drill into per-row scores for failure analysis, and use the run ID to retrieve associated artefacts from the tracking store. The metrics dict is suitable for automated threshold checks in CI pipelines, while result_df supports interactive debugging and reporting.
Code Reference
Source Location
- Repository: mlflow
- File:
mlflow/genai/evaluation/entities.py - Lines: L196-222
Signature
@dataclass
class EvaluationResult:
run_id: str
metrics: dict[str, float]
result_df: pd.DataFrame | None
@property
def tables(self) -> dict[str, pd.DataFrame]:
"""For backwards compatibility."""
return {"eval_results": self.result_df} if self.result_df is not None else {}
Import
# Typically obtained as the return value of mlflow.genai.evaluate():
import mlflow.genai
result = mlflow.genai.evaluate(data=..., scorers=[...])
# Direct import (rarely needed):
from mlflow.genai.evaluation.entities import EvaluationResult
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (constructed internally) | -- | -- | EvaluationResult is created by the evaluation harness. Users do not construct it directly.
|
Outputs
| Name | Type | Description |
|---|---|---|
| run_id | str |
The MLflow run ID for the evaluation. Use with mlflow.get_run(run_id) to retrieve the full run record.
|
| metrics | dict[str, float] |
Aggregated evaluation metrics. Keys follow {scorer_name}/{aggregation} convention (e.g., "correctness/mean"). Values are floats.
|
| result_df | pd.DataFrame or None |
Per-row results DataFrame. Columns include inputs, outputs, expectations, trace, and for each scorer: {scorer}/value, {scorer}/rationale, {scorer}/error_message, {scorer}/error_code.
|
| tables | dict[str, pd.DataFrame] |
Backwards-compatible property returning {"eval_results": result_df}.
|
Usage Examples
Basic Usage
import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety
data = [
{
"inputs": {"question": "What is MLflow?"},
"outputs": "MLflow is an open-source ML platform.",
"expectations": {"expected_response": "MLflow is an ML platform."},
},
]
result = mlflow.genai.evaluate(
data=data,
scorers=[Correctness(), Safety()],
)
# Access aggregated metrics
print(result.metrics)
# Example: {"correctness/mean": 1.0, "safety/mean": 1.0}
# Access per-row results
print(result.result_df.columns.tolist())
# Example: ["inputs", "outputs", "expectations", "correctness/value",
# "correctness/rationale", "safety/value", "safety/rationale", ...]
# Filter failing rows
failures = result.result_df[result.result_df["correctness/value"] == "no"]
print(failures[["inputs", "outputs", "correctness/rationale"]])
# Access the MLflow run
print(f"Run ID: {result.run_id}")
Threshold Checks for CI
result = mlflow.genai.evaluate(data=data, scorers=[Correctness(), Safety()])
# Automated quality gate
assert result.metrics["correctness/mean"] >= 0.9, (
f"Correctness dropped to {result.metrics['correctness/mean']}"
)
assert result.metrics["safety/mean"] >= 0.95, (
f"Safety dropped to {result.metrics['safety/mean']}"
)
Backwards-Compatible Tables Access
result = mlflow.genai.evaluate(data=data, scorers=[Correctness()])
# tables property for legacy code
eval_df = result.tables["eval_results"]
print(eval_df.head())