Heuristic:Mlflow Mlflow Nested Run Organization
| Knowledge Sources | |
|---|---|
| Domains | Experiment_Tracking, Best_Practices |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Organization pattern for hierarchical experiment tracking using nested MLflow runs with thread-local run stack management.
Description
MLflow maintains a thread-local stack of active runs that supports hierarchical (parent-child) run organization. Nested runs are useful for hyperparameter sweeps, cross-validation folds, or multi-stage training pipelines where each sub-task should be grouped under a parent run. The run stack is strictly thread-local, meaning each thread has its own isolated stack of active runs.
Usage
Use this heuristic when you need to organize related runs hierarchically — for example, a hyperparameter search where each trial is a child run, or a pipeline with distinct training stages. Also important to understand when working with multi-threaded training code, as the active run is not shared across threads.
The Insight (Rule of Thumb)
- Action: Use `mlflow.start_run(nested=True)` or `mlflow.start_run(parent_run_id=...)` to create child runs.
- Value: Parent run must be in ACTIVE state. Only one non-nested run per thread is allowed without calling `end_run()` first.
- Trade-off: Thread-local run stacks mean `mlflow.active_run()` only returns the run from the current thread. Multi-threaded code needs explicit run management.
Pattern:
with mlflow.start_run() as parent_run:
mlflow.log_param("search_space", "lr=[1e-4, 1e-2]")
for lr in [1e-4, 1e-3, 1e-2]:
with mlflow.start_run(nested=True) as child_run:
mlflow.log_param("lr", lr)
# train model...
mlflow.log_metric("accuracy", acc)
Reasoning
The thread-local run stack design ensures thread safety but creates constraints that developers need to understand:
Thread-local run stack from `mlflow/tracking/fluent.py:120-124`:
_active_run_stack = ThreadLocalVariable(default_factory=lambda: [])
_last_active_run_id = ThreadLocalVariable(default_factory=lambda: None)
_last_logged_model_id = ThreadLocalVariable(default_factory=lambda: None)
Parent run validation from `mlflow/tracking/fluent.py:325-374`:
def start_run(
run_id=None, experiment_id=None, run_name=None,
nested=False, parent_run_id=None, ...
):
# Parent run must be in ACTIVE state
# parent_run_id must match current active run if specified explicitly
# Only one non-nested run per thread without end_run()
Thread-safety limitation documented in docstring at `mlflow/tracking/fluent.py:650-677`:
# active_run() is thread-local and returns only the active run
# in the current thread. If a run is started in a different thread,
# this API will not retrieve that run.
Cross-thread run access (internal) from `mlflow/tracking/fluent.py:743-754`:
def _get_latest_active_run():
all_active_runs = [
run for run_stack in _active_run_stack.get_all_thread_values().values()
for run in run_stack
]
if all_active_runs:
return max(all_active_runs, key=lambda run: run.info.start_time)