Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlflow Mlflow Genai Evaluate

From Leeroopedia
Knowledge Sources
Domains ML_Ops, LLM_Evaluation
Last Updated 2026-02-13 20:00 GMT

Overview

Concrete tool for running comprehensive evaluation of LLM applications using datasets and scorers provided by the MLflow library.

Description

mlflow.genai.evaluate() is the primary public API for evaluating generative AI models and applications. It accepts an evaluation dataset, a list of scorers, and an optional predict function, then orchestrates the full evaluation pipeline: data normalisation, optional prediction, scoring, metric aggregation, and result persistence.

The function supports three main usage modes:

  1. Trace-based evaluation -- The dataset contains a trace column (e.g., from mlflow.search_traces()). Scorers extract inputs, outputs, and context directly from the trace objects. No predict function is needed.
  2. Pre-computed outputs -- The dataset contains inputs and outputs columns. Scorers evaluate the pre-computed outputs. No predict function is needed.
  3. Live prediction -- The dataset contains only inputs (and optionally expectations). A predict_fn is provided and called for each row to generate outputs and traces on the fly.

Internally, the function delegates to _run_harness, which validates scorers, converts the dataset, manages the MLflow run context, executes predictions and scoring, aggregates metrics, and returns an EvaluationResult. Scoring can be parallelised via the MLFLOW_GENAI_EVAL_MAX_WORKERS environment variable.

Usage

Call mlflow.genai.evaluate() during development iteration, CI pipeline evaluation, or production quality auditing. Use it with pre-recorded traces for offline analysis, with pre-computed outputs for deterministic regression testing, or with a live predict function for end-to-end evaluation.

Code Reference

Source Location

  • Repository: mlflow
  • File: mlflow/genai/evaluation/base.py
  • Lines: L55-302 (public API), L305-441 (internal _run_harness)

Signature

def evaluate(
    data: "EvaluationDatasetTypes",
    scorers: list[Scorer],
    predict_fn: Callable[..., Any] | None = None,
    model_id: str | None = None,
) -> "EvaluationResult":

Import

import mlflow.genai
# or
from mlflow.genai.evaluation.base import evaluate

I/O Contract

Inputs

Name Type Required Description
data EvaluationDatasetTypes Yes Evaluation dataset. Accepts pd.DataFrame, list[dict], list[Trace], EvaluationDataset, pyspark.sql.DataFrame, or ConversationSimulator.
scorers list[Scorer] Yes List of scorer instances (built-in or custom) that produce evaluation assessments.
predict_fn Callable[..., Any] No Optional callable whose keyword arguments match inputs dict keys. Required when dataset has no outputs or trace column.
model_id str No Optional model identifier (e.g., "m-074689...") to associate the evaluation run with a logged model. Can also be set via mlflow.set_active_model().

Outputs

Name Type Description
result EvaluationResult Object containing run_id (MLflow run ID), metrics (aggregated scores dict), and result_df (per-row DataFrame with scorer values, rationales, and error info).

Usage Examples

Basic Usage

import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety

data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source ML platform.",
        "expectations": {"expected_response": "MLflow is an ML platform."},
    },
]

result = mlflow.genai.evaluate(
    data=data,
    scorers=[Correctness(), Safety()],
)
print(result.metrics)

With Predict Function

import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety
import openai

def predict_fn(question: str) -> str:
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content

data = [
    {"inputs": {"question": "What is MLflow?"}},
    {"inputs": {"question": "What is Spark?"}},
]

result = mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=[Correctness(), Safety()],
)

With Traces from Tracking Store

import mlflow
import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety

# Retrieve previously recorded traces
trace_df = mlflow.search_traces(model_id="m-074689226d3b40bfbbdf4c3ff35832cd")

result = mlflow.genai.evaluate(
    data=trace_df,
    scorers=[Correctness(), Safety()],
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment