Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Metrics Utils

From Leeroopedia
Knowledge Sources
Domains Monitoring, Metrics, Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

Utility functions for downloading, configuring, and managing Prometheus and Grafana monitoring services within the NeMo Curator metrics subsystem.

Description

The utils module encapsulates all platform-specific logic for monitoring service lifecycle management. It provides nine functions that handle the complete workflow of downloading, extracting, configuring, running, and integrating Prometheus and Grafana with the NeMo Curator pipeline. Key capabilities include:

  • Prometheus management: Download and extract the Prometheus binary using Ray's built-in helpers, start it as a background subprocess with custom configuration, check whether it is running via process iteration with psutil, and detect its active port from process arguments.
  • Grafana management: Download and extract Grafana Enterprise for Linux x86_64, write provisioning configuration files (INI, datasource YAML, dashboard YAML), copy the bundled Xenna dashboard JSON, and launch the server as a background subprocess.
  • Ray integration: Dynamically add Ray service discovery paths to the Prometheus configuration and trigger a hot-reload via a POST request to the Prometheus lifecycle API.

Usage

These utility functions are primarily consumed by the start_prometheus_grafana.py launcher script, but can also be used independently when more granular control over the monitoring infrastructure is needed, such as dynamically registering Ray metric endpoints.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/metrics/utils.py
  • Lines: 1-259

Key Functions

def download_and_extract_prometheus(os_type=None, architecture=None, prometheus_version=None) -> str:
    """Download the Prometheus tarball and extract it."""

def is_prometheus_running() -> bool:
    """Check if Prometheus is currently running."""

def is_grafana_running() -> bool:
    """Check if Grafana is currently running."""

def get_prometheus_port() -> int:
    """Get the port number that Prometheus is running on."""

def run_prometheus(prometheus_dir: str, prometheus_web_port: int) -> None:
    """Run the Prometheus server as a background process."""

def download_grafana() -> str:
    """Download the Grafana tarball and extract it."""

def launch_grafana(grafana_dir: str, grafana_ini_path: str) -> None:
    """Launch the Grafana server as a background process."""

def write_grafana_configs(grafana_web_port: int, prometheus_web_port: int) -> str:
    """Write Grafana configuration files (INI, datasource, dashboard)."""

def add_ray_prometheus_metrics_service_discovery(ray_temp_dir: str) -> None:
    """Add Ray Prometheus metrics service discovery to the config."""

Import

from nemo_curator.metrics.utils import (
    download_and_extract_prometheus,
    is_prometheus_running,
    is_grafana_running,
    get_prometheus_port,
    run_prometheus,
    download_grafana,
    launch_grafana,
    write_grafana_configs,
    add_ray_prometheus_metrics_service_discovery,
)

I/O Contract

download_and_extract_prometheus

Name Type Required Description
os_type str or None No Override OS type for download URL
architecture str or None No Override architecture for download URL
prometheus_version str or None No Override Prometheus version

Returns str -- path to the extracted Prometheus directory.

run_prometheus

Name Type Required Description
prometheus_dir str Yes Path to the extracted Prometheus directory
prometheus_web_port int Yes Port to bind Prometheus web interface to

Returns None -- starts Prometheus as a background subprocess.

write_grafana_configs

Name Type Required Description
grafana_web_port int Yes Port for the Grafana web interface
prometheus_web_port int Yes Port of the running Prometheus instance (used in datasource config)

Returns str -- path to the generated grafana.ini file.

add_ray_prometheus_metrics_service_discovery

Name Type Required Description
ray_temp_dir str Yes Path to the Ray temporary directory containing prom_metrics_service_discovery.json

Returns None -- modifies the Prometheus config and sends a hot-reload request.

Usage Examples

Basic Usage

from nemo_curator.metrics.utils import (
    download_and_extract_prometheus,
    run_prometheus,
    is_prometheus_running,
)

# Download and start Prometheus
prometheus_dir = download_and_extract_prometheus()
run_prometheus(prometheus_dir, prometheus_web_port=9090)

# Check if running
if is_prometheus_running():
    print("Prometheus is active")

Add Ray Service Discovery

from nemo_curator.metrics.utils import add_ray_prometheus_metrics_service_discovery

# Register Ray metrics endpoint with running Prometheus
add_ray_prometheus_metrics_service_discovery("/tmp/ray/session_latest")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment