Implementation:NVIDIA DALI HW Decoder Benchmark

Knowledge Sources	NVIDIA_DALI
Domains	Benchmarking, Image_Decoding
Last Updated	2026-02-08 16:00 GMT

Overview

Benchmarking tool that measures hardware-accelerated image decoding throughput across multiple DALI pipeline configurations including ResNet-50, EfficientNet, and Vision Transformer workloads.

Description

This script (internal_tools/hw_decoder_bench.py) is a comprehensive performance benchmarking tool for NVIDIA DALI's hardware decoder capabilities. It defines six distinct pipeline configurations that represent common deep learning data loading patterns: a bare decoder pipeline, ResNet-50 (RN50) training pipeline with random crop and normalization, an NDD (NVIDIA DALI Dynamic) variant of RN50, EfficientNet inference and training pipelines with automatic augmentation support, and a Vision Transformer (ViT) pipeline with webdataset reading and color jitter augmentation.

The benchmark supports parameterized sweeps over CPU thread counts and HW decoder load fractions (both specified as single values or ranges in "start:end:step" format), enabling systematic exploration of the performance parameter space. For each configuration, it runs warmup iterations followed by timed iterations, collecting per-iteration timing statistics including mean, median, standard deviation, minimum, and maximum execution times. Multi-GPU support is built in, allowing benchmarks to span multiple devices starting from a specified device ID.

Key features include support for both the standard and experimental DALI decoders, configurable batch and minibatch sizes, preallocated decoder hints for width and height, feed-input mode for the EfficientNet inference pipeline (which uses external source with padded input tensors), and automatic selection of the best throughput configuration from a parameter sweep.

Usage

Use this script when measuring DALI image decoding performance on specific GPU hardware, tuning the HW decoder load parameter, or comparing pipeline configurations. It requires a directory of images (or explicit image list for inference mode) and a CUDA-capable GPU.

Code Reference

Source Location

Repository: NVIDIA_DALI
File: internal_tools/hw_decoder_bench.py
Lines: 1-655

Signature

# Pipeline definitions
@pipeline_def(...)
def DecoderPipeline(decoders_module=fn.decoders, hw_load=0): ...

@pipeline_def(...)
def RN50Pipeline(minibatch_size, decoders_module=fn.decoders, hw_load=0): ...

class NDDRN50Pipeline:
    def __init__(self, minibatch_size, batch_size, device_id, num_threads, ...): ...
    def build(self): ...
    def share_outputs(self): ...

@pipeline_def(...)
def EfficientnetTrainingPipeline(minibatch_size, automatic_augmentation, ...): ...

@pipeline_def(...)
def EfficientnetInferencePipeline(decoders_module=fn.decoders, hw_load=0): ...

@pipeline_def(...)
def vit_pipeline(is_training, image_shape, num_classes, ...): ...

# Utility functions
def parse_range_arg(arg_str, parse_fn=int): ...
def feed_input(dali_pipeline, data): ...
def create_input_tensor(batch_size, file_list): ...

Import

# Run as a standalone script
# python internal_tools/hw_decoder_bench.py [options]

# Key imports used internally
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.pipeline import pipeline_def
import nvidia.dali.experimental.dynamic as ndd
from nvidia.dali.auto_aug import auto_augment, trivial_augment

I/O Contract

Inputs

Name	Type	Required	Description
-i / images_dir	str	Yes*	Directory containing input images for benchmarking
--image_list	list of str	Yes*	Explicit list of image file paths (mutually exclusive with -i)
-b	int	No	Batch size (default: 1)
-p	str	No	Pipeline type: decoder, rn50, ndd_rn50, efficientnet_inference, vit, efficientnet_training (default: decoder)
-g	str	No	Device to use: gpu or cpu (default: gpu)
-d	int	No	Starting device ID (default: 0)
-n	int	No	Number of GPUs to use (default: 1)
-j	str	No	CPU thread count, single value or range start:end:step (default: 4)
--hw_load	str	No	HW decoder load fraction, single value or range start:end:step (default: 0.75)
-t	int	No	Total number of images to process (default: 100)
-w	int	No	Number of warmup iterations (default: 0)
--experimental_decoder	flag	No	Use experimental decoder instead of default

Outputs

Name	Type	Description
Throughput report	stdout	Per-configuration timing statistics: total time, throughput (frames/sec), mean/median/stddev/min/max iteration times
Best configuration	stdout	The thread count and HW load combination that achieved highest throughput

Usage Examples

Basic decoder benchmark

# Benchmark basic image decoding on GPU 0
# python internal_tools/hw_decoder_bench.py -i /data/imagenet/val -b 64 -t 1000 -w 10 -p decoder

RN50 pipeline sweep over thread counts and HW load

python internal_tools/hw_decoder_bench.py \
    -i /data/imagenet/train \
    -b 128 \
    -p rn50 \
    -j 1:8:1 \
    --hw_load 0.0:1.0:0.25 \
    -t 5000 \
    -w 50 \
    -n 4

EfficientNet training with TrivialAugment

python internal_tools/hw_decoder_bench.py \
    -i /data/imagenet/train \
    -b 256 \
    -p efficientnet_training \
    --aug-strategy trivialaugment \
    --hw_load 0.75 \
    -t 2000

Related Pages

Environment:NVIDIA_DALI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment