Implementation:NVIDIA NeMo Curator Metrics Utils
| Knowledge Sources | |
|---|---|
| Domains | Monitoring, Metrics, Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Utility functions for downloading, configuring, and managing Prometheus and Grafana monitoring services within the NeMo Curator metrics subsystem.
Description
The utils module encapsulates all platform-specific logic for monitoring service lifecycle management. It provides nine functions that handle the complete workflow of downloading, extracting, configuring, running, and integrating Prometheus and Grafana with the NeMo Curator pipeline. Key capabilities include:
- Prometheus management: Download and extract the Prometheus binary using Ray's built-in helpers, start it as a background subprocess with custom configuration, check whether it is running via process iteration with psutil, and detect its active port from process arguments.
- Grafana management: Download and extract Grafana Enterprise for Linux x86_64, write provisioning configuration files (INI, datasource YAML, dashboard YAML), copy the bundled Xenna dashboard JSON, and launch the server as a background subprocess.
- Ray integration: Dynamically add Ray service discovery paths to the Prometheus configuration and trigger a hot-reload via a POST request to the Prometheus lifecycle API.
Usage
These utility functions are primarily consumed by the start_prometheus_grafana.py launcher script, but can also be used independently when more granular control over the monitoring infrastructure is needed, such as dynamically registering Ray metric endpoints.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/metrics/utils.py
- Lines: 1-259
Key Functions
def download_and_extract_prometheus(os_type=None, architecture=None, prometheus_version=None) -> str:
"""Download the Prometheus tarball and extract it."""
def is_prometheus_running() -> bool:
"""Check if Prometheus is currently running."""
def is_grafana_running() -> bool:
"""Check if Grafana is currently running."""
def get_prometheus_port() -> int:
"""Get the port number that Prometheus is running on."""
def run_prometheus(prometheus_dir: str, prometheus_web_port: int) -> None:
"""Run the Prometheus server as a background process."""
def download_grafana() -> str:
"""Download the Grafana tarball and extract it."""
def launch_grafana(grafana_dir: str, grafana_ini_path: str) -> None:
"""Launch the Grafana server as a background process."""
def write_grafana_configs(grafana_web_port: int, prometheus_web_port: int) -> str:
"""Write Grafana configuration files (INI, datasource, dashboard)."""
def add_ray_prometheus_metrics_service_discovery(ray_temp_dir: str) -> None:
"""Add Ray Prometheus metrics service discovery to the config."""
Import
from nemo_curator.metrics.utils import (
download_and_extract_prometheus,
is_prometheus_running,
is_grafana_running,
get_prometheus_port,
run_prometheus,
download_grafana,
launch_grafana,
write_grafana_configs,
add_ray_prometheus_metrics_service_discovery,
)
I/O Contract
download_and_extract_prometheus
| Name | Type | Required | Description |
|---|---|---|---|
| os_type | str or None | No | Override OS type for download URL |
| architecture | str or None | No | Override architecture for download URL |
| prometheus_version | str or None | No | Override Prometheus version |
Returns str -- path to the extracted Prometheus directory.
run_prometheus
| Name | Type | Required | Description |
|---|---|---|---|
| prometheus_dir | str | Yes | Path to the extracted Prometheus directory |
| prometheus_web_port | int | Yes | Port to bind Prometheus web interface to |
Returns None -- starts Prometheus as a background subprocess.
write_grafana_configs
| Name | Type | Required | Description |
|---|---|---|---|
| grafana_web_port | int | Yes | Port for the Grafana web interface |
| prometheus_web_port | int | Yes | Port of the running Prometheus instance (used in datasource config) |
Returns str -- path to the generated grafana.ini file.
add_ray_prometheus_metrics_service_discovery
| Name | Type | Required | Description |
|---|---|---|---|
| ray_temp_dir | str | Yes | Path to the Ray temporary directory containing prom_metrics_service_discovery.json |
Returns None -- modifies the Prometheus config and sends a hot-reload request.
Usage Examples
Basic Usage
from nemo_curator.metrics.utils import (
download_and_extract_prometheus,
run_prometheus,
is_prometheus_running,
)
# Download and start Prometheus
prometheus_dir = download_and_extract_prometheus()
run_prometheus(prometheus_dir, prometheus_web_port=9090)
# Check if running
if is_prometheus_running():
print("Prometheus is active")
Add Ray Service Discovery
from nemo_curator.metrics.utils import add_ray_prometheus_metrics_service_discovery
# Register Ray metrics endpoint with running Prometheus
add_ray_prometheus_metrics_service_discovery("/tmp/ray/session_latest")
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_Start_Prometheus_Grafana -- Launcher script that uses these utilities