Principle:Mlflow Mlflow Scorer Definition
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, LLM_Evaluation |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Encapsulating evaluation criteria as composable, callable scoring units that produce structured quality assessments from model inputs, outputs, and contextual information.
Description
Evaluating a generative AI application requires measuring multiple dimensions of quality -- correctness, safety, guideline adherence, latency, and domain-specific criteria. The scorer definition principle establishes a uniform contract that every evaluation criterion must follow, regardless of whether the check is performed by an LLM judge, a deterministic rule, or a statistical computation. Each scorer is a callable object that receives a standardised set of keyword arguments (inputs, outputs, expectations, trace) and returns a structured assessment.
This design separates what is being measured from how evaluation is orchestrated. Scorers can be built-in (pre-packaged criteria like correctness or safety that delegate to LLM judges), or custom (user-defined functions decorated to conform to the scorer protocol). Built-in scorers encapsulate prompt engineering, judge model selection, and result parsing behind a clean interface. Custom scorers give users full flexibility: they can call external APIs, apply heuristic rules, inspect trace spans, or combine multiple signals.
The scorer contract also defines how individual per-row scores are aggregated into summary metrics. Each scorer can specify one or more aggregation strategies (mean, median, min, max, variance, p90, or a custom function) that reduce per-row values into a single metric. This two-level structure -- per-row assessments plus aggregated metrics -- enables both fine-grained debugging and high-level performance tracking.
Usage
Define scorers whenever establishing evaluation criteria for a generative AI application. Use built-in scorers for common quality dimensions where LLM-based judgment is appropriate. Create custom scorers when the evaluation logic requires domain-specific rules, external API calls, trace inspection, or deterministic computation. Combine multiple scorers in a single evaluation run to measure several quality dimensions simultaneously.
Theoretical Basis
The theoretical foundation rests on treating evaluation as a measurement framework with pluggable instruments. Each scorer is analogous to a measurement instrument with defined:
- Input specification: which subset of (inputs, outputs, expectations, trace) the scorer inspects.
- Measurement function: the logic that maps inputs to a quality signal.
- Output type: boolean (pass/fail), numeric (continuous score), string (categorical label), or a structured Feedback object with value, rationale, and provenance.
- Aggregation policy: how per-row measurements combine into summary statistics.
Scorers follow a strategy pattern, where each scorer is an interchangeable strategy evaluated by the harness. The harness introspects each scorer's call signature to pass only the parameters it requests, enabling scorers to declare their dependencies explicitly.
Pseudocode for scorer invocation:
function invoke_scorer(scorer, eval_item):
params = introspect_parameters(scorer)
kwargs = {}
for p in params:
kwargs[p] = extract_field(eval_item, p)
result = scorer(**kwargs)
return normalise_to_feedback(result)
The normalisation step converts primitive return values (bool, int, float, str) into structured Feedback objects, ensuring downstream consumers always receive a uniform assessment format.