Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlflow Mlflow Evaluation Result

From Leeroopedia
Knowledge Sources
Domains ML_Ops, LLM_Evaluation
Last Updated 2026-02-13 20:00 GMT

Overview

Concrete tool for accessing and interpreting evaluation results -- including aggregated metrics and per-row scores -- provided by the MLflow library.

Description

EvaluationResult is the dataclass returned by mlflow.genai.evaluate(). It bundles three pieces of information:

  1. run_id -- The MLflow run ID under which the evaluation was logged. This links the results to the experiment tracking store, enabling cross-run comparison and artefact retrieval.
  2. metrics -- A dictionary mapping metric names to aggregated float values. Metric names follow the convention {scorer_name}/{aggregation} (e.g., correctness/mean, safety/mean). This dictionary is also logged to the MLflow run, making it available in the MLflow UI and via the search API.
  3. result_df -- A pandas DataFrame containing per-row evaluation results. Each row corresponds to one evaluation input and includes the original inputs, outputs, expectations, and scorer assessment columns: {scorer_name}/value, {scorer_name}/rationale, {scorer_name}/error_message, and {scorer_name}/error_code.

For backwards compatibility, a tables property is provided that returns {"eval_results": result_df}.

Usage

Access the EvaluationResult returned by mlflow.genai.evaluate() to inspect aggregated quality metrics, drill into per-row scores for failure analysis, and use the run ID to retrieve associated artefacts from the tracking store. The metrics dict is suitable for automated threshold checks in CI pipelines, while result_df supports interactive debugging and reporting.

Code Reference

Source Location

  • Repository: mlflow
  • File: mlflow/genai/evaluation/entities.py
  • Lines: L196-222

Signature

@dataclass
class EvaluationResult:
    run_id: str
    metrics: dict[str, float]
    result_df: pd.DataFrame | None

    @property
    def tables(self) -> dict[str, pd.DataFrame]:
        """For backwards compatibility."""
        return {"eval_results": self.result_df} if self.result_df is not None else {}

Import

# Typically obtained as the return value of mlflow.genai.evaluate():
import mlflow.genai
result = mlflow.genai.evaluate(data=..., scorers=[...])

# Direct import (rarely needed):
from mlflow.genai.evaluation.entities import EvaluationResult

I/O Contract

Inputs

Name Type Required Description
(constructed internally) -- -- EvaluationResult is created by the evaluation harness. Users do not construct it directly.

Outputs

Name Type Description
run_id str The MLflow run ID for the evaluation. Use with mlflow.get_run(run_id) to retrieve the full run record.
metrics dict[str, float] Aggregated evaluation metrics. Keys follow {scorer_name}/{aggregation} convention (e.g., "correctness/mean"). Values are floats.
result_df pd.DataFrame or None Per-row results DataFrame. Columns include inputs, outputs, expectations, trace, and for each scorer: {scorer}/value, {scorer}/rationale, {scorer}/error_message, {scorer}/error_code.
tables dict[str, pd.DataFrame] Backwards-compatible property returning {"eval_results": result_df}.

Usage Examples

Basic Usage

import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety

data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source ML platform.",
        "expectations": {"expected_response": "MLflow is an ML platform."},
    },
]

result = mlflow.genai.evaluate(
    data=data,
    scorers=[Correctness(), Safety()],
)

# Access aggregated metrics
print(result.metrics)
# Example: {"correctness/mean": 1.0, "safety/mean": 1.0}

# Access per-row results
print(result.result_df.columns.tolist())
# Example: ["inputs", "outputs", "expectations", "correctness/value",
#           "correctness/rationale", "safety/value", "safety/rationale", ...]

# Filter failing rows
failures = result.result_df[result.result_df["correctness/value"] == "no"]
print(failures[["inputs", "outputs", "correctness/rationale"]])

# Access the MLflow run
print(f"Run ID: {result.run_id}")

Threshold Checks for CI

result = mlflow.genai.evaluate(data=data, scorers=[Correctness(), Safety()])

# Automated quality gate
assert result.metrics["correctness/mean"] >= 0.9, (
    f"Correctness dropped to {result.metrics['correctness/mean']}"
)
assert result.metrics["safety/mean"] >= 0.95, (
    f"Safety dropped to {result.metrics['safety/mean']}"
)

Backwards-Compatible Tables Access

result = mlflow.genai.evaluate(data=data, scorers=[Correctness()])

# tables property for legacy code
eval_df = result.tables["eval_results"]
print(eval_df.head())

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment