Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA DALI HW Decoder Benchmark

From Leeroopedia


Knowledge Sources
Domains Benchmarking, Image_Decoding
Last Updated 2026-02-08 16:00 GMT

Overview

Benchmarking tool that measures hardware-accelerated image decoding throughput across multiple DALI pipeline configurations including ResNet-50, EfficientNet, and Vision Transformer workloads.

Description

This script (internal_tools/hw_decoder_bench.py) is a comprehensive performance benchmarking tool for NVIDIA DALI's hardware decoder capabilities. It defines six distinct pipeline configurations that represent common deep learning data loading patterns: a bare decoder pipeline, ResNet-50 (RN50) training pipeline with random crop and normalization, an NDD (NVIDIA DALI Dynamic) variant of RN50, EfficientNet inference and training pipelines with automatic augmentation support, and a Vision Transformer (ViT) pipeline with webdataset reading and color jitter augmentation.

The benchmark supports parameterized sweeps over CPU thread counts and HW decoder load fractions (both specified as single values or ranges in "start:end:step" format), enabling systematic exploration of the performance parameter space. For each configuration, it runs warmup iterations followed by timed iterations, collecting per-iteration timing statistics including mean, median, standard deviation, minimum, and maximum execution times. Multi-GPU support is built in, allowing benchmarks to span multiple devices starting from a specified device ID.

Key features include support for both the standard and experimental DALI decoders, configurable batch and minibatch sizes, preallocated decoder hints for width and height, feed-input mode for the EfficientNet inference pipeline (which uses external source with padded input tensors), and automatic selection of the best throughput configuration from a parameter sweep.

Usage

Use this script when measuring DALI image decoding performance on specific GPU hardware, tuning the HW decoder load parameter, or comparing pipeline configurations. It requires a directory of images (or explicit image list for inference mode) and a CUDA-capable GPU.

Code Reference

Source Location

Signature

# Pipeline definitions
@pipeline_def(...)
def DecoderPipeline(decoders_module=fn.decoders, hw_load=0): ...

@pipeline_def(...)
def RN50Pipeline(minibatch_size, decoders_module=fn.decoders, hw_load=0): ...

class NDDRN50Pipeline:
    def __init__(self, minibatch_size, batch_size, device_id, num_threads, ...): ...
    def build(self): ...
    def share_outputs(self): ...

@pipeline_def(...)
def EfficientnetTrainingPipeline(minibatch_size, automatic_augmentation, ...): ...

@pipeline_def(...)
def EfficientnetInferencePipeline(decoders_module=fn.decoders, hw_load=0): ...

@pipeline_def(...)
def vit_pipeline(is_training, image_shape, num_classes, ...): ...

# Utility functions
def parse_range_arg(arg_str, parse_fn=int): ...
def feed_input(dali_pipeline, data): ...
def create_input_tensor(batch_size, file_list): ...

Import

# Run as a standalone script
# python internal_tools/hw_decoder_bench.py [options]

# Key imports used internally
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.pipeline import pipeline_def
import nvidia.dali.experimental.dynamic as ndd
from nvidia.dali.auto_aug import auto_augment, trivial_augment

I/O Contract

Inputs

Name Type Required Description
-i / images_dir str Yes* Directory containing input images for benchmarking
--image_list list of str Yes* Explicit list of image file paths (mutually exclusive with -i)
-b int No Batch size (default: 1)
-p str No Pipeline type: decoder, rn50, ndd_rn50, efficientnet_inference, vit, efficientnet_training (default: decoder)
-g str No Device to use: gpu or cpu (default: gpu)
-d int No Starting device ID (default: 0)
-n int No Number of GPUs to use (default: 1)
-j str No CPU thread count, single value or range start:end:step (default: 4)
--hw_load str No HW decoder load fraction, single value or range start:end:step (default: 0.75)
-t int No Total number of images to process (default: 100)
-w int No Number of warmup iterations (default: 0)
--experimental_decoder flag No Use experimental decoder instead of default

Outputs

Name Type Description
Throughput report stdout Per-configuration timing statistics: total time, throughput (frames/sec), mean/median/stddev/min/max iteration times
Best configuration stdout The thread count and HW load combination that achieved highest throughput

Usage Examples

Basic decoder benchmark

# Benchmark basic image decoding on GPU 0
# python internal_tools/hw_decoder_bench.py -i /data/imagenet/val -b 64 -t 1000 -w 10 -p decoder

RN50 pipeline sweep over thread counts and HW load

python internal_tools/hw_decoder_bench.py \
    -i /data/imagenet/train \
    -b 128 \
    -p rn50 \
    -j 1:8:1 \
    --hw_load 0.0:1.0:0.25 \
    -t 5000 \
    -w 50 \
    -n 4

EfficientNet training with TrivialAugment

python internal_tools/hw_decoder_bench.py \
    -i /data/imagenet/train \
    -b 256 \
    -p efficientnet_training \
    --aug-strategy trivialaugment \
    --hw_load 0.75 \
    -t 2000

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment