Implementation:Mlflow Mlflow Scorer Class

Knowledge Sources	MLflow MLflow GenAI API
Domains	ML_Ops, LLM_Evaluation
Last Updated	2026-02-13 20:00 GMT

Overview

Concrete tool for defining evaluation scorers -- both built-in LLM judges and custom scoring functions -- provided by the MLflow library.

Description

MLflow provides a Scorer base class and several mechanisms for creating scorers:

Scorer base class -- A Pydantic BaseModel subclass that defines the shared interface. Every scorer has a name, optional description, and optional aggregations list. Subclasses implement __call__ with keyword-only arguments drawn from inputs, outputs, expectations, trace, and session.

@scorer decorator -- A convenience decorator that wraps a plain function into a Scorer instance. The decorated function's parameter names determine which evaluation fields are injected. The decorator preserves the function's signature for introspection and supports optional name, description, and aggregations overrides.

Built-in scorers -- Pre-packaged scorers that delegate to MLflow's LLM judge infrastructure:

Correctness checks whether the model's response supports the expected facts or response.
Safety checks whether the response contains harmful, offensive, or toxic content.
Guidelines checks whether the response adheres to user-specified guidelines or constraints.

Each built-in scorer accepts an optional model parameter to override the default LLM judge model.

Usage

Use Correctness when ground-truth expectations are available and factual accuracy matters. Use Safety to screen outputs for harmful content. Use Guidelines when custom rules or constraints must be enforced. Use the @scorer decorator when evaluation logic requires custom code, external API calls, or trace inspection. Combine multiple scorers in a list passed to mlflow.genai.evaluate().

Code Reference

Source Location

Repository: mlflow
File: mlflow/genai/scorers/base.py (Scorer class: L179-658, @scorer decorator: L1020-1204)
File: mlflow/genai/scorers/builtin_scorers.py (Correctness: L1664-1857, Safety: L1558-1661, Guidelines: L1133-1278)

Signature

# Base class
class Scorer(BaseModel):
    name: str
    aggregations: list[_AggregationType] | None = None
    description: str | None = None

    def __call__(
        self,
        *,
        inputs: dict[str, Any] | None = None,
        outputs: Any | None = None,
        expectations: dict[str, Any] | None = None,
        trace: Trace | None = None,
    ) -> int | float | bool | str | Feedback | list[Feedback]: ...

# Decorator
def scorer(
    func: Callable | None = None,
    *,
    name: str | None = None,
    description: str | None = None,
    aggregations: list[_AggregationType] | None = None,
) -> Scorer | Callable[[Callable], Scorer]: ...

# Built-in scorers
class Correctness(BuiltInScorer):
    name: str = "correctness"
    model: str | None = None

class Safety(BuiltInScorer):
    name: str = "safety"
    model: str | None = None

class Guidelines(BuiltInScorer):
    name: str = "guidelines"
    guidelines: str | list[str]
    model: str | None = None

Import

from mlflow.genai.scorers import Scorer, scorer, Correctness, Safety, Guidelines

I/O Contract

Inputs

Name	Type	Required	Description
inputs	`dict[str, Any]`	No	Dictionary of model input key-value pairs.
outputs	`Any`	No	Model output for the evaluation row.
expectations	`dict[str, Any]`	No	Ground-truth values (e.g., `expected_response`, `expected_facts`).
trace	`Trace`	No	MLflow Trace object for the prediction.
name (constructor)	`str`	No	Custom name for the scorer (defaults to function name or built-in name).
guidelines (Guidelines)	`str or list[str]`	Yes (for Guidelines)	The guideline text(s) to check adherence against.
model (built-in)	`str`	No	Override the default LLM judge model name.
aggregations (constructor)	`list`	No	Aggregation functions: `"min"`, `"max"`, `"mean"`, `"median"`, `"variance"`, `"p90"`, or a callable.

Outputs

Name	Type	Description
result	`Feedback`	Structured assessment with `value` (bool/numeric/string), `rationale` (string explanation), and optional `error_message` / `error_code`.
result	`list[Feedback]`	Multiple assessments from a single scorer call.
result	`bool / int / float / str`	Primitive value automatically wrapped into a Feedback object by the harness.

Usage Examples

Basic Usage

from mlflow.genai.scorers import Correctness, Safety, Guidelines

# Built-in scorers
correctness = Correctness()
safety = Safety()
english_only = Guidelines(
    name="english_only",
    guidelines=["The response must be in English"],
)

# Use in evaluation
import mlflow.genai

data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source ML platform.",
        "expectations": {"expected_response": "MLflow is an ML platform."},
    },
]
result = mlflow.genai.evaluate(
    data=data,
    scorers=[correctness, safety, english_only],
)

Custom Scorer with Decorator

from mlflow.genai.scorers import scorer

@scorer
def not_empty(outputs) -> bool:
    return outputs != ""

@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]

@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)

Custom Scorer with Feedback Object

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentSource

@scorer
def custom_judge(inputs, outputs) -> Feedback:
    # Custom LLM-based evaluation logic
    is_relevant = check_relevance(inputs["question"], outputs)
    return Feedback(
        value=is_relevant,
        rationale="Response addresses the question directly.",
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="custom-judge",
        ),
    )

Related Pages

Implements Principle

Principle:Mlflow_Mlflow_Scorer_Definition

Requires Environment

Environment:Mlflow_Mlflow_OpenAI_LLM_Integration_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment