Heuristic:NVIDIA NeMo Curator GPU Memory Resource Allocation
| Knowledge Sources | |
|---|---|
| Domains | GPU_Computing, Optimization, Resource_Management |
| Last Updated | 2026-02-14 16:45 GMT |
Overview
Configure `gpu_memory_gb` per processing stage to enable fractional GPU sharing, with detection falling back to 24GB if pynvml is unavailable.
Description
NeMo Curator uses a `Resources` dataclass to declare per-stage GPU memory requirements. The framework automatically detects actual GPU memory via pynvml and calculates the fractional GPU allocation (e.g., a stage needing 10GB on a 40GB A100 gets `gpus=0.25`). This enables multiple stages to share a single GPU. If GPU detection fails, it defaults to 24GB, which is conservative for consumer GPUs but may cause under-utilization on data center GPUs (A100 80GB, H100 80GB).
Usage
Apply when configuring pipeline stages that require GPU resources. Set `gpu_memory_gb` for single-GPU stages or `gpus` for multi-GPU stages. Never set both — this raises a `ValueError`. For video processing stages, typical values are 10GB (TransNetV2 scene detection) and 20GB (Cosmos-Embed1 embeddings).
The Insight (Rule of Thumb)
- Action: Set `gpu_memory_gb` on stage `Resources` to the actual memory needed by that stage's model/computation.
- Values:
- TransNetV2 (scene detection): ~10GB
- Cosmos-Embed1 (video embeddings): ~20GB
- CLIP (image embeddings): 0.25 GPUs (fractional)
- Image filters (aesthetic/NSFW): 0.25 GPUs (fractional)
- Trade-off: Over-allocating wastes GPU resources. Under-allocating causes OOM.
- Fallback: If pynvml is unavailable, the framework assumes 24GB total GPU memory. On an 80GB GPU, this means stages will be allocated more GPU fraction than needed, limiting concurrency.
Reasoning
The resource calculation logic in `Resources.__post_init__()` divides the requested `gpu_memory_gb` by the detected total GPU memory to produce a fractional GPU allocation:
# From nemo_curator/stages/resources.py:18-29, 53-71
def _get_gpu_memory_gb() -> float:
"""Get GPU memory in GB for the current device."""
try:
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
return float(info.total) / (1024**3)
except Exception:
return 24.0 # Fallback to 24GB if detection fails
# In Resources.__post_init__:
if self.gpu_memory_gb > 0:
gpu_memory_per_device = _get_gpu_memory_gb()
required_gpus = self.gpu_memory_gb / gpu_memory_per_device
self.gpus = round(required_gpus, 1)
if self.gpus > 1:
raise ValueError("gpu_memory_gb is too large for a single GPU.")
The concurrency calculation in the Ray backend then uses this to determine how many actors can run concurrently:
# From nemo_curator/backends/experimental/ray_data/utils.py:30-53
max_cpu_actors = available_cpus / stage.resources.cpus
max_gpu_actors = available_gpus / stage.resources.gpus
return min(max_cpu_actors, max_gpu_actors) # bottleneck constraint