Implementation:Mlflow Mlflow Scorer Class
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, LLM_Evaluation |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Concrete tool for defining evaluation scorers -- both built-in LLM judges and custom scoring functions -- provided by the MLflow library.
Description
MLflow provides a Scorer base class and several mechanisms for creating scorers:
Scorer base class -- A Pydantic BaseModel subclass that defines the shared interface. Every scorer has a name, optional description, and optional aggregations list. Subclasses implement __call__ with keyword-only arguments drawn from inputs, outputs, expectations, trace, and session.
@scorer decorator -- A convenience decorator that wraps a plain function into a Scorer instance. The decorated function's parameter names determine which evaluation fields are injected. The decorator preserves the function's signature for introspection and supports optional name, description, and aggregations overrides.
Built-in scorers -- Pre-packaged scorers that delegate to MLflow's LLM judge infrastructure:
- Correctness checks whether the model's response supports the expected facts or response.
- Safety checks whether the response contains harmful, offensive, or toxic content.
- Guidelines checks whether the response adheres to user-specified guidelines or constraints.
Each built-in scorer accepts an optional model parameter to override the default LLM judge model.
Usage
Use Correctness when ground-truth expectations are available and factual accuracy matters. Use Safety to screen outputs for harmful content. Use Guidelines when custom rules or constraints must be enforced. Use the @scorer decorator when evaluation logic requires custom code, external API calls, or trace inspection. Combine multiple scorers in a list passed to mlflow.genai.evaluate().
Code Reference
Source Location
- Repository: mlflow
- File:
mlflow/genai/scorers/base.py(Scorer class: L179-658, @scorer decorator: L1020-1204) - File:
mlflow/genai/scorers/builtin_scorers.py(Correctness: L1664-1857, Safety: L1558-1661, Guidelines: L1133-1278)
Signature
# Base class
class Scorer(BaseModel):
name: str
aggregations: list[_AggregationType] | None = None
description: str | None = None
def __call__(
self,
*,
inputs: dict[str, Any] | None = None,
outputs: Any | None = None,
expectations: dict[str, Any] | None = None,
trace: Trace | None = None,
) -> int | float | bool | str | Feedback | list[Feedback]: ...
# Decorator
def scorer(
func: Callable | None = None,
*,
name: str | None = None,
description: str | None = None,
aggregations: list[_AggregationType] | None = None,
) -> Scorer | Callable[[Callable], Scorer]: ...
# Built-in scorers
class Correctness(BuiltInScorer):
name: str = "correctness"
model: str | None = None
class Safety(BuiltInScorer):
name: str = "safety"
model: str | None = None
class Guidelines(BuiltInScorer):
name: str = "guidelines"
guidelines: str | list[str]
model: str | None = None
Import
from mlflow.genai.scorers import Scorer, scorer, Correctness, Safety, Guidelines
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| inputs | dict[str, Any] |
No | Dictionary of model input key-value pairs. |
| outputs | Any |
No | Model output for the evaluation row. |
| expectations | dict[str, Any] |
No | Ground-truth values (e.g., expected_response, expected_facts).
|
| trace | Trace |
No | MLflow Trace object for the prediction. |
| name (constructor) | str |
No | Custom name for the scorer (defaults to function name or built-in name). |
| guidelines (Guidelines) | str or list[str] |
Yes (for Guidelines) | The guideline text(s) to check adherence against. |
| model (built-in) | str |
No | Override the default LLM judge model name. |
| aggregations (constructor) | list |
No | Aggregation functions: "min", "max", "mean", "median", "variance", "p90", or a callable.
|
Outputs
| Name | Type | Description |
|---|---|---|
| result | Feedback |
Structured assessment with value (bool/numeric/string), rationale (string explanation), and optional error_message / error_code.
|
| result | list[Feedback] |
Multiple assessments from a single scorer call. |
| result | bool / int / float / str |
Primitive value automatically wrapped into a Feedback object by the harness. |
Usage Examples
Basic Usage
from mlflow.genai.scorers import Correctness, Safety, Guidelines
# Built-in scorers
correctness = Correctness()
safety = Safety()
english_only = Guidelines(
name="english_only",
guidelines=["The response must be in English"],
)
# Use in evaluation
import mlflow.genai
data = [
{
"inputs": {"question": "What is MLflow?"},
"outputs": "MLflow is an open-source ML platform.",
"expectations": {"expected_response": "MLflow is an ML platform."},
},
]
result = mlflow.genai.evaluate(
data=data,
scorers=[correctness, safety, english_only],
)
Custom Scorer with Decorator
from mlflow.genai.scorers import scorer
@scorer
def not_empty(outputs) -> bool:
return outputs != ""
@scorer
def exact_match(outputs, expectations) -> bool:
return outputs == expectations["expected_response"]
@scorer
def num_tool_calls(trace) -> int:
spans = trace.search_spans(name="tool_call")
return len(spans)
Custom Scorer with Feedback Object
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentSource
@scorer
def custom_judge(inputs, outputs) -> Feedback:
# Custom LLM-based evaluation logic
is_relevant = check_relevance(inputs["question"], outputs)
return Feedback(
value=is_relevant,
rationale="Response addresses the question directly.",
source=AssessmentSource(
source_type="LLM_JUDGE",
source_id="custom-judge",
),
)