Implementation:Mlflow Mlflow Evaluation Result

Knowledge Sources	MLflow MLflow GenAI API
Domains	ML_Ops, LLM_Evaluation
Last Updated	2026-02-13 20:00 GMT

Overview

Concrete tool for accessing and interpreting evaluation results -- including aggregated metrics and per-row scores -- provided by the MLflow library.

Description

EvaluationResult is the dataclass returned by mlflow.genai.evaluate(). It bundles three pieces of information:

run_id -- The MLflow run ID under which the evaluation was logged. This links the results to the experiment tracking store, enabling cross-run comparison and artefact retrieval.
metrics -- A dictionary mapping metric names to aggregated float values. Metric names follow the convention {scorer_name}/{aggregation} (e.g., correctness/mean, safety/mean). This dictionary is also logged to the MLflow run, making it available in the MLflow UI and via the search API.
result_df -- A pandas DataFrame containing per-row evaluation results. Each row corresponds to one evaluation input and includes the original inputs, outputs, expectations, and scorer assessment columns: {scorer_name}/value, {scorer_name}/rationale, {scorer_name}/error_message, and {scorer_name}/error_code.

For backwards compatibility, a tables property is provided that returns {"eval_results": result_df}.

Usage

Access the EvaluationResult returned by mlflow.genai.evaluate() to inspect aggregated quality metrics, drill into per-row scores for failure analysis, and use the run ID to retrieve associated artefacts from the tracking store. The metrics dict is suitable for automated threshold checks in CI pipelines, while result_df supports interactive debugging and reporting.

Code Reference

Source Location

Repository: mlflow
File: mlflow/genai/evaluation/entities.py
Lines: L196-222

Signature

@dataclass
class EvaluationResult:
    run_id: str
    metrics: dict[str, float]
    result_df: pd.DataFrame | None

    @property
    def tables(self) -> dict[str, pd.DataFrame]:
        """For backwards compatibility."""
        return {"eval_results": self.result_df} if self.result_df is not None else {}

Import

# Typically obtained as the return value of mlflow.genai.evaluate():
import mlflow.genai
result = mlflow.genai.evaluate(data=..., scorers=[...])

# Direct import (rarely needed):
from mlflow.genai.evaluation.entities import EvaluationResult

I/O Contract

Inputs

Name	Type	Required	Description
(constructed internally)	--	--	`EvaluationResult` is created by the evaluation harness. Users do not construct it directly.

Outputs

Name	Type	Description
run_id	`str`	The MLflow run ID for the evaluation. Use with `mlflow.get_run(run_id)` to retrieve the full run record.
metrics	`dict[str, float]`	Aggregated evaluation metrics. Keys follow `{scorer_name}/{aggregation}` convention (e.g., `"correctness/mean"`). Values are floats.
result_df	`pd.DataFrame or None`	Per-row results DataFrame. Columns include `inputs`, `outputs`, `expectations`, `trace`, and for each scorer: `{scorer}/value`, `{scorer}/rationale`, `{scorer}/error_message`, `{scorer}/error_code`.
tables	`dict[str, pd.DataFrame]`	Backwards-compatible property returning `{"eval_results": result_df}`.

Usage Examples

Basic Usage

import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety

data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source ML platform.",
        "expectations": {"expected_response": "MLflow is an ML platform."},
    },
]

result = mlflow.genai.evaluate(
    data=data,
    scorers=[Correctness(), Safety()],
)

# Access aggregated metrics
print(result.metrics)
# Example: {"correctness/mean": 1.0, "safety/mean": 1.0}

# Access per-row results
print(result.result_df.columns.tolist())
# Example: ["inputs", "outputs", "expectations", "correctness/value",
#           "correctness/rationale", "safety/value", "safety/rationale", ...]

# Filter failing rows
failures = result.result_df[result.result_df["correctness/value"] == "no"]
print(failures[["inputs", "outputs", "correctness/rationale"]])

# Access the MLflow run
print(f"Run ID: {result.run_id}")

Threshold Checks for CI

result = mlflow.genai.evaluate(data=data, scorers=[Correctness(), Safety()])

# Automated quality gate
assert result.metrics["correctness/mean"] >= 0.9, (
    f"Correctness dropped to {result.metrics['correctness/mean']}"
)
assert result.metrics["safety/mean"] >= 0.95, (
    f"Safety dropped to {result.metrics['safety/mean']}"
)

Backwards-Compatible Tables Access

result = mlflow.genai.evaluate(data=data, scorers=[Correctness()])

# tables property for legacy code
eval_df = result.tables["eval_results"]
print(eval_df.head())

Related Pages

Implements Principle

Principle:Mlflow_Mlflow_Evaluation_Result_Analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment