Implementation:NVIDIA DALI RunImpl CUDA Kernel

Knowledge Sources	NVIDIA DALI
Domains	Custom_Operators, GPU_Computing, CUDA_Programming
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete GPU-specialized RunImpl() method and naive_histogram_kernel CUDA kernel provided by the NaiveHistogram example in NVIDIA DALI, demonstrating per-sample kernel launch, workspace stream integration, and atomic histogram accumulation.

Description

The NaiveHistogram<GPUBackend>::RunImpl() method (defined in naive_histogram.cu) is the explicit specialization of RunImpl() for the GPU backend. It:

Retrieves the input batch via ws.Input<GPUBackend>(0), which returns a const TensorList<GPUBackend> &.
Retrieves the output batch via ws.Output<GPUBackend>(0), which returns a mutable TensorList<GPUBackend> & with buffers pre-allocated by SetupImpl().
Iterates over each sample in the batch using shape.num_samples().
For each sample, computes the 1-D grid and block dimensions based on volume(input.tensor_shape(sample_idx)).
Launches the naive_histogram_kernel on ws.stream(), passing per-sample data pointers obtained via input[sample_idx].data<uint8_t>() and output[sample_idx].mutable_data<int32_t>().

The naive_histogram_kernel is a __global__ CUDA function that:

Computes a global thread ID from blockIdx.x * blockDim.x + threadIdx.x.
Performs a bounds check against input_size.
Reads the pixel value and computes bin = value % n_bins.
Atomically increments histogram[bin] using atomicAdd.

Usage

This pattern is used whenever a custom DALI operator needs to execute a CUDA kernel. The .cu file contains both the kernel definition and the RunImpl() specialization for GPUBackend.

Code Reference

Source Location

Repository: NVIDIA DALI
File: docs/examples/custom_operations/custom_operator/naive_histogram/naive_histogram.cu (lines 30-58)

Signature

// CUDA kernel
__global__ void naive_histogram_kernel(
        const uint8_t *input, const int input_size, const int n_bins,
        int32_t *histogram);

// RunImpl specialization for GPUBackend
template<>
void NaiveHistogram<GPUBackend>::RunImpl(Workspace &ws);

Import

#include "naive_histogram.h"
// CUDA runtime is implicitly available in .cu files

I/O Contract

Inputs

Name	Type	Required	Description
ws	`dali::Workspace &`	Yes	Workspace providing input tensors, output tensors, and the CUDA stream.
ws.Input<GPUBackend>(0)	`const TensorList<GPUBackend> &`	Yes	Batch of single-channel uint8 images residing on the GPU. Each sample may have a different spatial size.
input (kernel)	`const uint8_t *`	Yes	Pointer to the raw pixel data for a single sample.
input_size (kernel)	`int`	Yes	Total number of pixels in the sample (product of all spatial dimensions).
n_bins (kernel)	`int`	Yes	Number of histogram bins, passed from the operator's n_histogram_bins_ member.

Outputs

Name	Type	Description
ws.Output<GPUBackend>(0)	`TensorList<GPUBackend>`	Batch of 1-D int32 histogram vectors, shape [n_bins] per sample. Pre-allocated by SetupImpl().
histogram (kernel)	`int32_t *`	Pointer to the output histogram array for a single sample; bins are incremented via atomicAdd.

Usage Examples

Example: Full RunImpl with per-sample kernel launch

template<>
void NaiveHistogram<GPUBackend>::RunImpl(Workspace &ws) {
  const auto &input = ws.Input<GPUBackend>(0);
  const auto &shape = input.shape();
  auto &output = ws.Output<GPUBackend>(0);

  for (int sample_idx = 0; sample_idx < shape.num_samples(); sample_idx++) {
    dim3 block_size(32);
    auto input_size = volume(input.tensor_shape(sample_idx));
    dim3 grid_size((input_size + block_size.x - 1) / block_size.x);

    naive_histogram_kernel<<<grid_size, block_size, 0, ws.stream()>>>(
            input[sample_idx].data<uint8_t>(),
            input_size,
            n_histogram_bins_,
            output[sample_idx].mutable_data<int32_t>()
    );
  }
}

Example: CUDA kernel for naive histogram

__global__ void naive_histogram_kernel(
        const uint8_t *input, const int input_size, const int n_bins,
        int32_t *histogram) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid >= input_size) return;
  auto value = input[tid];
  int bin = value % n_bins;
  atomicAdd(&histogram[bin], 1);
}

Related Pages

Implements Principle

Principle:NVIDIA_DALI_CUDA_Kernel_Implementation

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment