Principle:Mlflow Mlflow Run Management
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, Experiment_Tracking |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Managing individual experiment runs as the fundamental unit of recorded work in an experiment tracking system.
Description
A run represents a single execution of a machine learning workflow -- one training attempt, one hyperparameter configuration, one evaluation pass. It is the atomic unit of record in experiment tracking, serving as the container into which parameters, metrics, artifacts, and metadata are logged. Without a well-defined run boundary, there is no way to associate a particular set of results with the conditions that produced them.
Run management encompasses the full lifecycle of this unit: creation, activation, data logging (during the run's active window), and termination. Each run carries a unique identifier, belongs to exactly one experiment, and transitions through a sequence of states (running, finished, failed, killed). The system maintains a notion of the "active run," which is the implicit target for all logging operations within the current execution context.
Runs may also form hierarchical relationships. A parent run can represent a high-level orchestration (such as a hyperparameter sweep), while child runs capture the individual trials. This nesting provides both organizational clarity and the ability to aggregate or compare results at different levels of granularity.
Usage
Create a run at the beginning of any training, evaluation, or data processing task that you wish to track. Use the run as a context manager to ensure clean termination regardless of whether the code succeeds or fails. Employ nested runs when orchestrating multiple related sub-experiments such as cross-validation folds or grid search iterations. Resume an existing run by its identifier when continuing interrupted work or appending late-arriving results. Enable system metrics logging when infrastructure utilization data (CPU, GPU, memory) is needed for performance analysis.
Theoretical Basis
Run management implements a scoped execution context pattern:
1. Activation: A run is created or resumed and pushed onto a thread-local run stack. This stack enables nested execution: the most recently pushed run is the implicit target for logging calls. The stack discipline ensures that ending a child run restores the parent run as the active context.
2. State Machine: Each run follows a state progression from RUNNING to a terminal state (FINISHED, FAILED, or KILLED). The transition to RUNNING happens at creation or resumption. The transition to a terminal state happens when the run is explicitly ended or when the context manager exits. This state machine enables dashboards and queries to distinguish between in-progress and completed work.
3. Identity and Lineage: Every run receives a universally unique identifier at creation. Parent-child relationships are recorded via tags, allowing the system to reconstruct the hierarchy without requiring a dedicated relational schema for nesting. This tag-based approach keeps the run data model flat while still supporting arbitrary depth of nesting.
4. Automatic Metadata: At creation time, the system attaches contextual metadata (source file, git commit, user name, entry point) as system tags. This automatic enrichment reduces the burden on the practitioner while ensuring reproducibility-critical information is always captured.