Implementation:Mlflow Mlflow Genai Evaluate

Knowledge Sources	MLflow MLflow GenAI API
Domains	ML_Ops, LLM_Evaluation
Last Updated	2026-02-13 20:00 GMT

Overview

Concrete tool for running comprehensive evaluation of LLM applications using datasets and scorers provided by the MLflow library.

Description

mlflow.genai.evaluate() is the primary public API for evaluating generative AI models and applications. It accepts an evaluation dataset, a list of scorers, and an optional predict function, then orchestrates the full evaluation pipeline: data normalisation, optional prediction, scoring, metric aggregation, and result persistence.

The function supports three main usage modes:

Trace-based evaluation -- The dataset contains a trace column (e.g., from mlflow.search_traces()). Scorers extract inputs, outputs, and context directly from the trace objects. No predict function is needed.
Pre-computed outputs -- The dataset contains inputs and outputs columns. Scorers evaluate the pre-computed outputs. No predict function is needed.
Live prediction -- The dataset contains only inputs (and optionally expectations). A predict_fn is provided and called for each row to generate outputs and traces on the fly.

Internally, the function delegates to _run_harness, which validates scorers, converts the dataset, manages the MLflow run context, executes predictions and scoring, aggregates metrics, and returns an EvaluationResult. Scoring can be parallelised via the MLFLOW_GENAI_EVAL_MAX_WORKERS environment variable.

Usage

Call mlflow.genai.evaluate() during development iteration, CI pipeline evaluation, or production quality auditing. Use it with pre-recorded traces for offline analysis, with pre-computed outputs for deterministic regression testing, or with a live predict function for end-to-end evaluation.

Code Reference

Source Location

Repository: mlflow
File: mlflow/genai/evaluation/base.py
Lines: L55-302 (public API), L305-441 (internal _run_harness)

Signature

def evaluate(
    data: "EvaluationDatasetTypes",
    scorers: list[Scorer],
    predict_fn: Callable[..., Any] | None = None,
    model_id: str | None = None,
) -> "EvaluationResult":

Import

import mlflow.genai
# or
from mlflow.genai.evaluation.base import evaluate

I/O Contract

Inputs

Name	Type	Required	Description
data	`EvaluationDatasetTypes`	Yes	Evaluation dataset. Accepts `pd.DataFrame`, `list[dict]`, `list[Trace]`, `EvaluationDataset`, `pyspark.sql.DataFrame`, or `ConversationSimulator`.
scorers	`list[Scorer]`	Yes	List of scorer instances (built-in or custom) that produce evaluation assessments.
predict_fn	`Callable[..., Any]`	No	Optional callable whose keyword arguments match `inputs` dict keys. Required when dataset has no `outputs` or `trace` column.
model_id	`str`	No	Optional model identifier (e.g., `"m-074689..."`) to associate the evaluation run with a logged model. Can also be set via `mlflow.set_active_model()`.

Outputs

Name	Type	Description
result	`EvaluationResult`	Object containing `run_id` (MLflow run ID), `metrics` (aggregated scores dict), and `result_df` (per-row DataFrame with scorer values, rationales, and error info).

Usage Examples

Basic Usage

import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety

data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source ML platform.",
        "expectations": {"expected_response": "MLflow is an ML platform."},
    },
]

result = mlflow.genai.evaluate(
    data=data,
    scorers=[Correctness(), Safety()],
)
print(result.metrics)

With Predict Function

import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety
import openai

def predict_fn(question: str) -> str:
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content

data = [
    {"inputs": {"question": "What is MLflow?"}},
    {"inputs": {"question": "What is Spark?"}},
]

result = mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=[Correctness(), Safety()],
)

With Traces from Tracking Store

import mlflow
import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety

# Retrieve previously recorded traces
trace_df = mlflow.search_traces(model_id="m-074689226d3b40bfbbdf4c3ff35832cd")

result = mlflow.genai.evaluate(
    data=trace_df,
    scorers=[Correctness(), Safety()],
)

Related Pages

Implements Principle

Principle:Mlflow_Mlflow_GenAI_Evaluation_Execution

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment