Heuristic:NVIDIA DALI NVJPEG Memory Preallocation

Knowledge Sources	NVIDIA DALI ResNet50 Training Example
Domains	Optimization, Computer_Vision, GPU_Computing
Last Updated	2026-02-08 16:00 GMT

Overview

Memory preallocation technique for NVJPEG GPU image decoding that prevents runtime reallocations by pre-sizing buffers to the maximum image dimensions in the dataset.

Description

When using DALI's `fn.decoders.image_random_crop` or `fn.decoders.image` with `device="mixed"` (GPU-accelerated JPEG decoding via NVJPEG), the decoder must allocate GPU memory for each decoded image. Without preallocation hints, DALI reallocates GPU memory whenever it encounters an image larger than the current buffer, causing performance degradation from repeated `cudaMalloc`/`cudaFree` cycles. By setting `preallocate_width_hint` and `preallocate_height_hint` to the maximum image dimensions in the dataset, the decoder preallocates buffers once upfront.

Usage

Use this heuristic when you are using GPU-accelerated JPEG decoding (`device="mixed"`) and observe throughput instability or GPU memory allocation overhead. The canonical use case is ImageNet training where the maximum image dimensions are known (5980x6430 pixels). For CPU decoding (`device="cpu"`), set both hints to 0.

The Insight (Rule of Thumb)

Action: Set `preallocate_width_hint` and `preallocate_height_hint` in `fn.decoders.image_random_crop` or `fn.decoders.image` when using `device="mixed"`.
Value: For ImageNet: `preallocate_width_hint=5980`, `preallocate_height_hint=6430`. For other datasets, set to the maximum width and height across all images.
Trade-off: Slightly higher initial GPU memory usage (buffers are preallocated for the largest image) in exchange for stable throughput without runtime reallocation stalls.
Complementary: Combine with `num_attempts=100` for random crop to ensure valid crops are found within reasonable time.

Reasoning

NVJPEG hardware JPEG decoding on GPU requires contiguous device memory for the decoded output. If the decoder encounters an image larger than the currently allocated buffer, it must reallocate, which involves synchronous `cudaMalloc`/`cudaFree` calls that stall the GPU pipeline. In ImageNet (1.2M images), image dimensions vary from a few hundred pixels to 5980x6430 pixels. Without preallocation, the first few thousand images may cause hundreds of reallocations until the buffer stabilizes at the maximum size. With preallocation hints, the buffer is allocated once at the correct size during pipeline build.

The values `5980` and `6430` come from empirical analysis of the ImageNet dataset, representing the maximum width and height of any image in the training set. These exact values appear consistently across all DALI example implementations (PyTorch, PaddlePaddle, TensorFlow).

Code Evidence

From `docs/examples/use_cases/pytorch/resnet50/main.py:120-132`:

preallocate_width_hint = 5980 if decoder_device == "mixed" else 0
preallocate_height_hint = 6430 if decoder_device == "mixed" else 0

images = fn.decoders.image_random_crop(
    images,
    device=decoder_device,
    output_type=types.RGB,
    preallocate_width_hint=preallocate_width_hint,
    preallocate_height_hint=preallocate_height_hint,
    random_aspect_ratio=[0.8, 1.25],
    random_area=[0.1, 1.0],
    num_attempts=100,
)

From `docs/examples/use_cases/paddle/resnet50/dali.py:56-68`:

preallocate_width_hint = 5980 if decoder_device == "mixed" else 0
preallocate_height_hint = 6430 if decoder_device == "mixed" else 0

images = fn.decoders.image_random_crop(
    images,
    device=decoder_device,
    output_type=types.RGB,
    preallocate_width_hint=preallocate_width_hint,
    preallocate_height_hint=preallocate_height_hint,
    random_aspect_ratio=[0.75, 1.25],
    random_area=[0.05, 1.0],
    num_attempts=100)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment