Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Mlflow Mlflow GenAI Evaluation

From Leeroopedia
Knowledge Sources
Domains LLM_Ops, Evaluation, GenAI
Last Updated 2026-02-13 20:00 GMT

Overview

End-to-end process for evaluating LLM applications, prompts, and AI agents using MLflow's GenAI evaluation framework with built-in and custom scorers.

Description

This workflow covers the procedure for systematically evaluating the quality of LLM-powered applications. It uses MLflow's GenAI evaluation harness to run a dataset of inputs through a predict function (or use pre-computed outputs), then apply a suite of scorers — both built-in LLM judges and custom evaluation functions — to assess response quality. Built-in scorers include correctness, safety, fluency, relevance, retrieval groundedness, and tool call efficiency. Results are logged as metrics and viewable in the MLflow evaluation UI.

Key capabilities:

  • Built-in LLM judge scorers for common quality dimensions
  • Custom scorer support via Python functions
  • Trace-based evaluation from existing logged traces
  • Conversation simulation for multi-turn agent testing
  • Session-level scoring for conversational quality assessment

Usage

Execute this workflow when you need to measure the quality of an LLM application, compare prompt variants, validate agent behavior, or establish quality baselines before deployment. This applies to question-answering systems, RAG pipelines, chatbots, AI agents, and any GenAI application where output quality matters.

Execution Steps

Step 1: Prepare Evaluation Dataset

Assemble a dataset of test inputs with optional expected outputs (ground truth). The dataset can be a list of dictionaries, a Pandas DataFrame, or traces retrieved from MLflow. Each record should contain the input fields that the application expects, and optionally expectations for comparison.

Key considerations:

  • Dataset format uses inputs, outputs (optional), and expectations (optional) fields
  • For trace-based evaluation, retrieve traces from logged experiments
  • Ground truth expectations enable correctness and equivalence scoring

Step 2: Define or Select Scorers

Choose from built-in scorers or create custom scoring functions. Built-in scorers use LLM judges to assess dimensions like correctness, safety, fluency, and relevance. Custom scorers can implement any evaluation logic using Python functions decorated with the scorer decorator.

Key considerations:

  • Built-in scorers include Correctness, Safety, Fluency, RelevanceToQuery, Guidelines, and more
  • Guidelines scorer allows custom rubric-based evaluation with free-text criteria
  • Custom scorers receive inputs, outputs, expectations, and optional trace data

Step 3: Define Predict Function (Optional)

If the evaluation dataset does not contain pre-computed outputs, define a predict function that takes input fields and returns the application response. The evaluation harness calls this function for each input row and captures the output along with any generated traces.

Key considerations:

  • The predict function signature should accept keyword arguments matching the input fields
  • Both synchronous and asynchronous predict functions are supported
  • If outputs are already present in the dataset, no predict function is needed

Step 4: Run Evaluation

Execute the evaluation by calling the evaluate function with the dataset, scorers, and optional predict function. The harness iterates over dataset rows, optionally generates predictions, and applies all scorers to produce per-row and aggregate results.

Key considerations:

  • Evaluation runs within an MLflow experiment for result tracking
  • Built-in scorers call an LLM judge (configurable) to assess quality
  • Results include per-row scores and aggregate metrics across the dataset

Step 5: Analyze Results

Review evaluation results in the MLflow UI or programmatically. Results include per-row scorer outputs with justifications, aggregate metrics (mean, median, etc.), and the full evaluation dataset with scores appended. Compare results across evaluation runs to track quality changes.

Key considerations:

  • The Evaluations tab in the MLflow UI shows a comparison matrix
  • Each scorer produces both a numeric score and a text justification
  • Results are logged as metrics to the active MLflow run for tracking

Execution Diagram

GitHub URL

Workflow Repository