Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlflow Mlflow Scorer Class

From Leeroopedia
Knowledge Sources
Domains ML_Ops, LLM_Evaluation
Last Updated 2026-02-13 20:00 GMT

Overview

Concrete tool for defining evaluation scorers -- both built-in LLM judges and custom scoring functions -- provided by the MLflow library.

Description

MLflow provides a Scorer base class and several mechanisms for creating scorers:

Scorer base class -- A Pydantic BaseModel subclass that defines the shared interface. Every scorer has a name, optional description, and optional aggregations list. Subclasses implement __call__ with keyword-only arguments drawn from inputs, outputs, expectations, trace, and session.

@scorer decorator -- A convenience decorator that wraps a plain function into a Scorer instance. The decorated function's parameter names determine which evaluation fields are injected. The decorator preserves the function's signature for introspection and supports optional name, description, and aggregations overrides.

Built-in scorers -- Pre-packaged scorers that delegate to MLflow's LLM judge infrastructure:

  • Correctness checks whether the model's response supports the expected facts or response.
  • Safety checks whether the response contains harmful, offensive, or toxic content.
  • Guidelines checks whether the response adheres to user-specified guidelines or constraints.

Each built-in scorer accepts an optional model parameter to override the default LLM judge model.

Usage

Use Correctness when ground-truth expectations are available and factual accuracy matters. Use Safety to screen outputs for harmful content. Use Guidelines when custom rules or constraints must be enforced. Use the @scorer decorator when evaluation logic requires custom code, external API calls, or trace inspection. Combine multiple scorers in a list passed to mlflow.genai.evaluate().

Code Reference

Source Location

  • Repository: mlflow
  • File: mlflow/genai/scorers/base.py (Scorer class: L179-658, @scorer decorator: L1020-1204)
  • File: mlflow/genai/scorers/builtin_scorers.py (Correctness: L1664-1857, Safety: L1558-1661, Guidelines: L1133-1278)

Signature

# Base class
class Scorer(BaseModel):
    name: str
    aggregations: list[_AggregationType] | None = None
    description: str | None = None

    def __call__(
        self,
        *,
        inputs: dict[str, Any] | None = None,
        outputs: Any | None = None,
        expectations: dict[str, Any] | None = None,
        trace: Trace | None = None,
    ) -> int | float | bool | str | Feedback | list[Feedback]: ...

# Decorator
def scorer(
    func: Callable | None = None,
    *,
    name: str | None = None,
    description: str | None = None,
    aggregations: list[_AggregationType] | None = None,
) -> Scorer | Callable[[Callable], Scorer]: ...

# Built-in scorers
class Correctness(BuiltInScorer):
    name: str = "correctness"
    model: str | None = None

class Safety(BuiltInScorer):
    name: str = "safety"
    model: str | None = None

class Guidelines(BuiltInScorer):
    name: str = "guidelines"
    guidelines: str | list[str]
    model: str | None = None

Import

from mlflow.genai.scorers import Scorer, scorer, Correctness, Safety, Guidelines

I/O Contract

Inputs

Name Type Required Description
inputs dict[str, Any] No Dictionary of model input key-value pairs.
outputs Any No Model output for the evaluation row.
expectations dict[str, Any] No Ground-truth values (e.g., expected_response, expected_facts).
trace Trace No MLflow Trace object for the prediction.
name (constructor) str No Custom name for the scorer (defaults to function name or built-in name).
guidelines (Guidelines) str or list[str] Yes (for Guidelines) The guideline text(s) to check adherence against.
model (built-in) str No Override the default LLM judge model name.
aggregations (constructor) list No Aggregation functions: "min", "max", "mean", "median", "variance", "p90", or a callable.

Outputs

Name Type Description
result Feedback Structured assessment with value (bool/numeric/string), rationale (string explanation), and optional error_message / error_code.
result list[Feedback] Multiple assessments from a single scorer call.
result bool / int / float / str Primitive value automatically wrapped into a Feedback object by the harness.

Usage Examples

Basic Usage

from mlflow.genai.scorers import Correctness, Safety, Guidelines

# Built-in scorers
correctness = Correctness()
safety = Safety()
english_only = Guidelines(
    name="english_only",
    guidelines=["The response must be in English"],
)

# Use in evaluation
import mlflow.genai

data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source ML platform.",
        "expectations": {"expected_response": "MLflow is an ML platform."},
    },
]
result = mlflow.genai.evaluate(
    data=data,
    scorers=[correctness, safety, english_only],
)

Custom Scorer with Decorator

from mlflow.genai.scorers import scorer

@scorer
def not_empty(outputs) -> bool:
    return outputs != ""

@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]

@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)

Custom Scorer with Feedback Object

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentSource

@scorer
def custom_judge(inputs, outputs) -> Feedback:
    # Custom LLM-based evaluation logic
    is_relevant = check_relevance(inputs["question"], outputs)
    return Feedback(
        value=is_relevant,
        rationale="Response addresses the question directly.",
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="custom-judge",
        ),
    )

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment