Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Mlflow Mlflow GPU System Metrics Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Monitoring
Last Updated 2026-02-13 20:00 GMT

Overview

GPU and system metrics monitoring environment requiring pynvml (NVIDIA), pyrsmi (AMD ROCm), and psutil for hardware telemetry collection during MLflow runs.

Description

This environment provides the optional hardware monitoring capabilities for MLflow experiment tracking. When enabled via `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING`, MLflow collects CPU utilization, memory usage, disk I/O, network statistics, and GPU metrics during training runs. NVIDIA GPU monitoring uses the `pynvml` library (from `nvidia-ml-py`), while AMD GPU monitoring uses `pyrsmi`. The base CPU/disk/network metrics require `psutil`.

Usage

Use this environment when you need hardware telemetry during model training or any MLflow run. Enable system metrics by setting `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true` or passing `log_system_metrics=True` to `mlflow.start_run()`. GPU metrics are only collected when the corresponding GPU library is installed and a compatible GPU is detected.

System Requirements

Category Requirement Notes
OS Linux (recommended), macOS, Windows Full GPU support on Linux only
Hardware (NVIDIA) NVIDIA GPU with NVML support Any CUDA-capable GPU
Hardware (AMD) AMD GPU with ROCm support MI250x and compatible GPUs
Python >= 3.10 Same as core MLflow

Dependencies

System Packages

  • NVIDIA GPU driver (for NVIDIA GPU monitoring)
  • ROCm driver (for AMD GPU monitoring)

Python Packages

  • `psutil` (CPU, memory, disk, network monitoring)
  • `nvidia-ml-py` (NVIDIA GPU monitoring, provides `pynvml`)
  • `pyrsmi` (AMD ROCm GPU monitoring)

Credentials

The following environment variables control system metrics collection:

  • `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING`: Enable/disable system metrics (default: false)
  • `MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL`: Sampling interval in seconds (default: 10)
  • `MLFLOW_SYSTEM_METRICS_SAMPLES_BEFORE_LOGGING`: Number of samples to aggregate before logging (default: 1)
  • `MLFLOW_SYSTEM_METRICS_NODE_ID`: Node identifier for distributed training scenarios
  • `MLFLOW_DEFAULT_PREDICTION_DEVICE`: Device for prediction ("cpu" or "cuda")

Quick Install

# Install system metrics dependencies
pip install psutil

# For NVIDIA GPU monitoring
pip install nvidia-ml-py

# For AMD GPU monitoring
pip install pyrsmi

# Enable system metrics logging
export MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true

Code Evidence

GPU monitor initialization with error handling from `mlflow/system_metrics/metrics/gpu_monitor.py:10-32`:

try:
    import pynvml
except ImportError:
    pass

class GPUMonitor(BaseMetricsMonitor):
    def __init__(self):
        if "pynvml" not in sys.modules:
            raise ImportError(
                "`nvidia-ml-py` is not installed, to log GPU metrics please run "
                "`pip install nvidia-ml-py` to install it."
            )
        try:
            pynvml.nvmlInit()
        except pynvml.NVMLError as e:
            raise RuntimeError(
                f"Failed to initialize NVML, skip logging GPU metrics: {e}"
            )

GPU metrics collection from `mlflow/system_metrics/metrics/gpu_monitor.py:38-60`:

def collect_metrics(self):
    for i, handle in enumerate(self.gpu_handles):
        memory = pynvml.nvmlDeviceGetMemoryInfo(handle)
        self._metrics[f"gpu_{i}_memory_usage_percentage"].append(
            round(memory.used / memory.total * 100, 1)
        )
        self._metrics[f"gpu_{i}_memory_usage_megabytes"].append(memory.used / 1e6)
        device_utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
        self._metrics[f"gpu_{i}_utilization_percentage"].append(device_utilization.gpu)

System metrics flag from `mlflow/tracking/fluent.py:372-374`:

log_system_metrics: bool, defaults to None. If True, system metrics will be logged
    to MLflow, e.g., cpu/gpu utilization. If None, we will check environment variable
    `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING` to determine whether to log system metrics.

Common Errors

Error Message Cause Solution
`nvidia-ml-py is not installed, to log GPU metrics...` pynvml package not installed `pip install nvidia-ml-py`
`Failed to initialize NVML, skip logging GPU metrics` No NVIDIA GPU or driver issue Install NVIDIA drivers or verify GPU availability
`NVMLError_LibraryNotFound` NVML shared library not found Install NVIDIA GPU driver
`ImportError: pyrsmi` AMD ROCm monitoring library not installed `pip install pyrsmi`

Compatibility Notes

  • NVIDIA GPUs: Metrics include memory usage (percentage and MB), utilization percentage, power usage (watts and percentage). Requires NVIDIA driver with NVML support.
  • AMD GPUs (ROCm): Similar metrics to NVIDIA but uses the ROCm SMI Python interface. Requires ROCm driver installation.
  • CPU-only systems: GPU monitors are silently skipped when no GPU libraries are installed. CPU, disk, and network metrics still work with just `psutil`.
  • Distributed training: Use `MLFLOW_SYSTEM_METRICS_NODE_ID` to distinguish metrics from different nodes.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment