Implementation:Mlflow Mlflow Genai Evaluate
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, LLM_Evaluation |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Concrete tool for running comprehensive evaluation of LLM applications using datasets and scorers provided by the MLflow library.
Description
mlflow.genai.evaluate() is the primary public API for evaluating generative AI models and applications. It accepts an evaluation dataset, a list of scorers, and an optional predict function, then orchestrates the full evaluation pipeline: data normalisation, optional prediction, scoring, metric aggregation, and result persistence.
The function supports three main usage modes:
- Trace-based evaluation -- The dataset contains a
tracecolumn (e.g., frommlflow.search_traces()). Scorers extract inputs, outputs, and context directly from the trace objects. No predict function is needed. - Pre-computed outputs -- The dataset contains
inputsandoutputscolumns. Scorers evaluate the pre-computed outputs. No predict function is needed. - Live prediction -- The dataset contains only
inputs(and optionallyexpectations). Apredict_fnis provided and called for each row to generate outputs and traces on the fly.
Internally, the function delegates to _run_harness, which validates scorers, converts the dataset, manages the MLflow run context, executes predictions and scoring, aggregates metrics, and returns an EvaluationResult. Scoring can be parallelised via the MLFLOW_GENAI_EVAL_MAX_WORKERS environment variable.
Usage
Call mlflow.genai.evaluate() during development iteration, CI pipeline evaluation, or production quality auditing. Use it with pre-recorded traces for offline analysis, with pre-computed outputs for deterministic regression testing, or with a live predict function for end-to-end evaluation.
Code Reference
Source Location
- Repository: mlflow
- File:
mlflow/genai/evaluation/base.py - Lines: L55-302 (public API), L305-441 (internal
_run_harness)
Signature
def evaluate(
data: "EvaluationDatasetTypes",
scorers: list[Scorer],
predict_fn: Callable[..., Any] | None = None,
model_id: str | None = None,
) -> "EvaluationResult":
Import
import mlflow.genai
# or
from mlflow.genai.evaluation.base import evaluate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | EvaluationDatasetTypes |
Yes | Evaluation dataset. Accepts pd.DataFrame, list[dict], list[Trace], EvaluationDataset, pyspark.sql.DataFrame, or ConversationSimulator.
|
| scorers | list[Scorer] |
Yes | List of scorer instances (built-in or custom) that produce evaluation assessments. |
| predict_fn | Callable[..., Any] |
No | Optional callable whose keyword arguments match inputs dict keys. Required when dataset has no outputs or trace column.
|
| model_id | str |
No | Optional model identifier (e.g., "m-074689...") to associate the evaluation run with a logged model. Can also be set via mlflow.set_active_model().
|
Outputs
| Name | Type | Description |
|---|---|---|
| result | EvaluationResult |
Object containing run_id (MLflow run ID), metrics (aggregated scores dict), and result_df (per-row DataFrame with scorer values, rationales, and error info).
|
Usage Examples
Basic Usage
import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety
data = [
{
"inputs": {"question": "What is MLflow?"},
"outputs": "MLflow is an open-source ML platform.",
"expectations": {"expected_response": "MLflow is an ML platform."},
},
]
result = mlflow.genai.evaluate(
data=data,
scorers=[Correctness(), Safety()],
)
print(result.metrics)
With Predict Function
import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety
import openai
def predict_fn(question: str) -> str:
response = openai.OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
)
return response.choices[0].message.content
data = [
{"inputs": {"question": "What is MLflow?"}},
{"inputs": {"question": "What is Spark?"}},
]
result = mlflow.genai.evaluate(
data=data,
predict_fn=predict_fn,
scorers=[Correctness(), Safety()],
)
With Traces from Tracking Store
import mlflow
import mlflow.genai
from mlflow.genai.scorers import Correctness, Safety
# Retrieve previously recorded traces
trace_df = mlflow.search_traces(model_id="m-074689226d3b40bfbbdf4c3ff35832cd")
result = mlflow.genai.evaluate(
data=trace_df,
scorers=[Correctness(), Safety()],
)