Principle:NVIDIA DALI CUDA Kernel Implementation

Knowledge Sources	NVIDIA DALI Documentation NVIDIA DALI
Domains	Custom_Operators, GPU_Computing, CUDA_Programming
Last Updated	2026-02-08 00:00 GMT

Overview

A CUDA kernel implementation in a DALI custom operator involves specializing the RunImpl() method for GPUBackend, iterating over samples in the batch, launching a __global__ CUDA kernel on the workspace's CUDA stream, and writing results to pre-allocated GPU output tensors.

Description

CUDA kernel implementation is the pattern by which a custom DALI operator performs GPU-accelerated computation within the RunImpl() method. The key elements of this pattern are:

Backend Specialization: The RunImpl() method is explicitly specialized for GPUBackend in a .cu file (e.g., template<> void NaiveHistogram<GPUBackend>::RunImpl(Workspace &ws)). This keeps CUDA code isolated from the header and .cc files.

Workspace I/O Access: Input and output tensors are obtained from the Workspace via ws.Input<GPUBackend>(index) and ws.Output<GPUBackend>(index). These return TensorList references that provide per-sample access through the subscript operator.

Per-Sample Kernel Launch: DALI operators process batches, but individual samples within a batch may have different sizes. The RunImpl() method iterates over shape.num_samples() and launches a separate CUDA kernel invocation for each sample, computing grid and block dimensions based on the sample's tensor volume.

Stream Integration: All kernel launches use ws.stream() as the CUDA stream argument to the <<<grid, block, smem, stream>>> launch syntax. This ensures that the operator's GPU work is properly sequenced within DALI's asynchronous execution model and avoids implicit synchronization.

Atomic Operations for Aggregation: For reduction-style operations (such as histogram computation), CUDA's atomicAdd() is used to safely accumulate results from multiple threads into shared output bins.

Usage

Use this pattern when implementing the GPU path of any custom DALI operator that requires CUDA kernel execution. This is the standard approach for operations such as image transformations, signal processing, statistical computations, and any data-parallel task that benefits from GPU acceleration.

Theoretical Basis

The DALI CUDA kernel pattern builds on the SIMT (Single Instruction, Multiple Threads) execution model of NVIDIA GPUs. Each kernel launch distributes work across a grid of thread blocks, where each thread processes one or more data elements. The canonical thread indexing formula tid = blockIdx.x * blockDim.x + threadIdx.x maps each thread to a unique element in the flattened input tensor.

The per-sample iteration pattern (rather than launching a single kernel over the entire batch) is necessary because DALI batches may contain variable-size samples. Each sample's data pointer and size are different, requiring separate kernel configurations. While this introduces multiple kernel launches per batch, CUDA's stream-based asynchronous execution ensures these launches are pipelined efficiently.

The use of ws.stream() implements stream-ordered execution, a fundamental CUDA programming principle where operations on the same stream execute in order, while operations on different streams can overlap. DALI's executor manages stream assignment to maximize pipeline throughput across prefetch, compute, and output stages.

Atomic operations (atomicAdd) provide thread-safe accumulation at the cost of serialization when multiple threads target the same memory address. For histogram-like workloads, this is acceptable because the number of bin collisions is typically low relative to the input size, and the alternative (per-thread local histograms with a reduction step) adds complexity without significant benefit for moderate bin counts.

Related Pages

Implemented By

Implementation:NVIDIA_DALI_RunImpl_CUDA_Kernel

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment