Environment:Mlflow Mlflow GPU System Metrics Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Monitoring |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
GPU and system metrics monitoring environment requiring pynvml (NVIDIA), pyrsmi (AMD ROCm), and psutil for hardware telemetry collection during MLflow runs.
Description
This environment provides the optional hardware monitoring capabilities for MLflow experiment tracking. When enabled via `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING`, MLflow collects CPU utilization, memory usage, disk I/O, network statistics, and GPU metrics during training runs. NVIDIA GPU monitoring uses the `pynvml` library (from `nvidia-ml-py`), while AMD GPU monitoring uses `pyrsmi`. The base CPU/disk/network metrics require `psutil`.
Usage
Use this environment when you need hardware telemetry during model training or any MLflow run. Enable system metrics by setting `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true` or passing `log_system_metrics=True` to `mlflow.start_run()`. GPU metrics are only collected when the corresponding GPU library is installed and a compatible GPU is detected.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended), macOS, Windows | Full GPU support on Linux only |
| Hardware (NVIDIA) | NVIDIA GPU with NVML support | Any CUDA-capable GPU |
| Hardware (AMD) | AMD GPU with ROCm support | MI250x and compatible GPUs |
| Python | >= 3.10 | Same as core MLflow |
Dependencies
System Packages
- NVIDIA GPU driver (for NVIDIA GPU monitoring)
- ROCm driver (for AMD GPU monitoring)
Python Packages
- `psutil` (CPU, memory, disk, network monitoring)
- `nvidia-ml-py` (NVIDIA GPU monitoring, provides `pynvml`)
- `pyrsmi` (AMD ROCm GPU monitoring)
Credentials
The following environment variables control system metrics collection:
- `MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING`: Enable/disable system metrics (default: false)
- `MLFLOW_SYSTEM_METRICS_SAMPLING_INTERVAL`: Sampling interval in seconds (default: 10)
- `MLFLOW_SYSTEM_METRICS_SAMPLES_BEFORE_LOGGING`: Number of samples to aggregate before logging (default: 1)
- `MLFLOW_SYSTEM_METRICS_NODE_ID`: Node identifier for distributed training scenarios
- `MLFLOW_DEFAULT_PREDICTION_DEVICE`: Device for prediction ("cpu" or "cuda")
Quick Install
# Install system metrics dependencies
pip install psutil
# For NVIDIA GPU monitoring
pip install nvidia-ml-py
# For AMD GPU monitoring
pip install pyrsmi
# Enable system metrics logging
export MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true
Code Evidence
GPU monitor initialization with error handling from `mlflow/system_metrics/metrics/gpu_monitor.py:10-32`:
try:
import pynvml
except ImportError:
pass
class GPUMonitor(BaseMetricsMonitor):
def __init__(self):
if "pynvml" not in sys.modules:
raise ImportError(
"`nvidia-ml-py` is not installed, to log GPU metrics please run "
"`pip install nvidia-ml-py` to install it."
)
try:
pynvml.nvmlInit()
except pynvml.NVMLError as e:
raise RuntimeError(
f"Failed to initialize NVML, skip logging GPU metrics: {e}"
)
GPU metrics collection from `mlflow/system_metrics/metrics/gpu_monitor.py:38-60`:
def collect_metrics(self):
for i, handle in enumerate(self.gpu_handles):
memory = pynvml.nvmlDeviceGetMemoryInfo(handle)
self._metrics[f"gpu_{i}_memory_usage_percentage"].append(
round(memory.used / memory.total * 100, 1)
)
self._metrics[f"gpu_{i}_memory_usage_megabytes"].append(memory.used / 1e6)
device_utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
self._metrics[f"gpu_{i}_utilization_percentage"].append(device_utilization.gpu)
System metrics flag from `mlflow/tracking/fluent.py:372-374`:
log_system_metrics: bool, defaults to None. If True, system metrics will be logged
to MLflow, e.g., cpu/gpu utilization. If None, we will check environment variable
`MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING` to determine whether to log system metrics.
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `nvidia-ml-py is not installed, to log GPU metrics...` | pynvml package not installed | `pip install nvidia-ml-py` |
| `Failed to initialize NVML, skip logging GPU metrics` | No NVIDIA GPU or driver issue | Install NVIDIA drivers or verify GPU availability |
| `NVMLError_LibraryNotFound` | NVML shared library not found | Install NVIDIA GPU driver |
| `ImportError: pyrsmi` | AMD ROCm monitoring library not installed | `pip install pyrsmi` |
Compatibility Notes
- NVIDIA GPUs: Metrics include memory usage (percentage and MB), utilization percentage, power usage (watts and percentage). Requires NVIDIA driver with NVML support.
- AMD GPUs (ROCm): Similar metrics to NVIDIA but uses the ROCm SMI Python interface. Requires ROCm driver installation.
- CPU-only systems: GPU monitors are silently skipped when no GPU libraries are installed. CPU, disk, and network metrics still work with just `psutil`.
- Distributed training: Use `MLFLOW_SYSTEM_METRICS_NODE_ID` to distinguish metrics from different nodes.