Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mlflow Mlflow Async Logging Best Practices

From Leeroopedia
Knowledge Sources
Domains Optimization, Experiment_Tracking
Last Updated 2026-02-13 20:00 GMT

Overview

Performance optimization using asynchronous logging to avoid blocking training loops when logging metrics, parameters, and artifacts to MLflow.

Description

MLflow supports asynchronous logging via the `MLFLOW_ENABLE_ASYNC_LOGGING` environment variable. When enabled, calls to `log_metric()`, `log_param()`, `log_batch()`, and artifact logging operations are dispatched to a thread pool instead of executing synchronously. This prevents the training loop from being blocked by network I/O to the tracking server. The thread pool size is configurable, and an optional buffering window can batch log operations for improved throughput.

Usage

Use this heuristic when you are logging metrics at high frequency during training (e.g., every step or every few steps) and the tracking server is remote (not file-based). The synchronous overhead of HTTP requests can measurably slow down training when logging thousands of metrics. This is especially relevant for distributed training scenarios where multiple workers log simultaneously.

The Insight (Rule of Thumb)

  • Action: Set `MLFLOW_ENABLE_ASYNC_LOGGING=true` before starting your MLflow run.
  • Value: Thread pool default is 10 workers (`MLFLOW_ASYNC_LOGGING_THREADPOOL_SIZE=10`). Optionally set `MLFLOW_ASYNC_LOGGING_BUFFERING_SECONDS` to batch operations.
  • Trade-off: Logging errors are deferred and may not be immediately visible. You must call `mlflow.flush_async_logging()` before program exit to ensure all pending data is flushed.
  • Compatibility: Works with all fluent API logging functions (`log_metric`, `log_param`, `log_batch`, artifact logging).

Reasoning

Synchronous logging creates a network round-trip for every API call. In high-frequency logging scenarios (e.g., logging loss every 10 steps with a 50ms round-trip to the tracking server), this adds 5 seconds of overhead per 1000 logging calls. Async logging moves this I/O to a background thread pool, allowing the main training thread to continue uninterrupted. The buffering window can further reduce overhead by batching multiple log calls into fewer HTTP requests.

The MLflow codebase provides three flush functions that must be called at appropriate times:

  • `flush_async_logging()` — Flushes metric/param/tag logging
  • `flush_artifact_async_logging()` — Flushes artifact uploads
  • `flush_trace_async_logging()` — Flushes trace data

Code evidence from `mlflow/tracking/fluent.py:864-881`:

def flush_async_logging():
    """Flush all pending async logging."""
    ...

def _shut_down_async_logging():
    """Shut down all async logging threads."""
    ...

def flush_artifact_async_logging():
    """Flush pending artifact async logging."""
    ...

def flush_trace_async_logging(terminate=False):
    """Flush pending trace async logging."""
    ...

Configuration from `mlflow/environment_variables.py`:

MLFLOW_ENABLE_ASYNC_LOGGING = _BooleanEnvironmentVariable(
    "MLFLOW_ENABLE_ASYNC_LOGGING", False
)
MLFLOW_ASYNC_LOGGING_THREADPOOL_SIZE = _EnvironmentVariable(
    "MLFLOW_ASYNC_LOGGING_THREADPOOL_SIZE", int, 10
)
MLFLOW_ASYNC_LOGGING_BUFFERING_SECONDS = _EnvironmentVariable(
    "MLFLOW_ASYNC_LOGGING_BUFFERING_SECONDS", int, None
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment