Implementation:NVIDIA DALI HW Decoder Benchmark
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Image_Decoding |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Benchmarking tool that measures hardware-accelerated image decoding throughput across multiple DALI pipeline configurations including ResNet-50, EfficientNet, and Vision Transformer workloads.
Description
This script (internal_tools/hw_decoder_bench.py) is a comprehensive performance benchmarking tool for NVIDIA DALI's hardware decoder capabilities. It defines six distinct pipeline configurations that represent common deep learning data loading patterns: a bare decoder pipeline, ResNet-50 (RN50) training pipeline with random crop and normalization, an NDD (NVIDIA DALI Dynamic) variant of RN50, EfficientNet inference and training pipelines with automatic augmentation support, and a Vision Transformer (ViT) pipeline with webdataset reading and color jitter augmentation.
The benchmark supports parameterized sweeps over CPU thread counts and HW decoder load fractions (both specified as single values or ranges in "start:end:step" format), enabling systematic exploration of the performance parameter space. For each configuration, it runs warmup iterations followed by timed iterations, collecting per-iteration timing statistics including mean, median, standard deviation, minimum, and maximum execution times. Multi-GPU support is built in, allowing benchmarks to span multiple devices starting from a specified device ID.
Key features include support for both the standard and experimental DALI decoders, configurable batch and minibatch sizes, preallocated decoder hints for width and height, feed-input mode for the EfficientNet inference pipeline (which uses external source with padded input tensors), and automatic selection of the best throughput configuration from a parameter sweep.
Usage
Use this script when measuring DALI image decoding performance on specific GPU hardware, tuning the HW decoder load parameter, or comparing pipeline configurations. It requires a directory of images (or explicit image list for inference mode) and a CUDA-capable GPU.
Code Reference
Source Location
- Repository: NVIDIA_DALI
- File: internal_tools/hw_decoder_bench.py
- Lines: 1-655
Signature
# Pipeline definitions
@pipeline_def(...)
def DecoderPipeline(decoders_module=fn.decoders, hw_load=0): ...
@pipeline_def(...)
def RN50Pipeline(minibatch_size, decoders_module=fn.decoders, hw_load=0): ...
class NDDRN50Pipeline:
def __init__(self, minibatch_size, batch_size, device_id, num_threads, ...): ...
def build(self): ...
def share_outputs(self): ...
@pipeline_def(...)
def EfficientnetTrainingPipeline(minibatch_size, automatic_augmentation, ...): ...
@pipeline_def(...)
def EfficientnetInferencePipeline(decoders_module=fn.decoders, hw_load=0): ...
@pipeline_def(...)
def vit_pipeline(is_training, image_shape, num_classes, ...): ...
# Utility functions
def parse_range_arg(arg_str, parse_fn=int): ...
def feed_input(dali_pipeline, data): ...
def create_input_tensor(batch_size, file_list): ...
Import
# Run as a standalone script
# python internal_tools/hw_decoder_bench.py [options]
# Key imports used internally
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.pipeline import pipeline_def
import nvidia.dali.experimental.dynamic as ndd
from nvidia.dali.auto_aug import auto_augment, trivial_augment
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -i / images_dir | str | Yes* | Directory containing input images for benchmarking |
| --image_list | list of str | Yes* | Explicit list of image file paths (mutually exclusive with -i) |
| -b | int | No | Batch size (default: 1) |
| -p | str | No | Pipeline type: decoder, rn50, ndd_rn50, efficientnet_inference, vit, efficientnet_training (default: decoder) |
| -g | str | No | Device to use: gpu or cpu (default: gpu) |
| -d | int | No | Starting device ID (default: 0) |
| -n | int | No | Number of GPUs to use (default: 1) |
| -j | str | No | CPU thread count, single value or range start:end:step (default: 4) |
| --hw_load | str | No | HW decoder load fraction, single value or range start:end:step (default: 0.75) |
| -t | int | No | Total number of images to process (default: 100) |
| -w | int | No | Number of warmup iterations (default: 0) |
| --experimental_decoder | flag | No | Use experimental decoder instead of default |
Outputs
| Name | Type | Description |
|---|---|---|
| Throughput report | stdout | Per-configuration timing statistics: total time, throughput (frames/sec), mean/median/stddev/min/max iteration times |
| Best configuration | stdout | The thread count and HW load combination that achieved highest throughput |
Usage Examples
Basic decoder benchmark
# Benchmark basic image decoding on GPU 0
# python internal_tools/hw_decoder_bench.py -i /data/imagenet/val -b 64 -t 1000 -w 10 -p decoder
RN50 pipeline sweep over thread counts and HW load
python internal_tools/hw_decoder_bench.py \
-i /data/imagenet/train \
-b 128 \
-p rn50 \
-j 1:8:1 \
--hw_load 0.0:1.0:0.25 \
-t 5000 \
-w 50 \
-n 4
EfficientNet training with TrivialAugment
python internal_tools/hw_decoder_bench.py \
-i /data/imagenet/train \
-b 256 \
-p efficientnet_training \
--aug-strategy trivialaugment \
--hw_load 0.75 \
-t 2000