Implementation:NVIDIA DALI RunImpl CUDA Kernel
| Knowledge Sources | |
|---|---|
| Domains | Custom_Operators, GPU_Computing, CUDA_Programming |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete GPU-specialized RunImpl() method and naive_histogram_kernel CUDA kernel provided by the NaiveHistogram example in NVIDIA DALI, demonstrating per-sample kernel launch, workspace stream integration, and atomic histogram accumulation.
Description
The NaiveHistogram<GPUBackend>::RunImpl() method (defined in naive_histogram.cu) is the explicit specialization of RunImpl() for the GPU backend. It:
- Retrieves the input batch via ws.Input<GPUBackend>(0), which returns a const TensorList<GPUBackend> &.
- Retrieves the output batch via ws.Output<GPUBackend>(0), which returns a mutable TensorList<GPUBackend> & with buffers pre-allocated by SetupImpl().
- Iterates over each sample in the batch using shape.num_samples().
- For each sample, computes the 1-D grid and block dimensions based on volume(input.tensor_shape(sample_idx)).
- Launches the naive_histogram_kernel on ws.stream(), passing per-sample data pointers obtained via input[sample_idx].data<uint8_t>() and output[sample_idx].mutable_data<int32_t>().
The naive_histogram_kernel is a __global__ CUDA function that:
- Computes a global thread ID from blockIdx.x * blockDim.x + threadIdx.x.
- Performs a bounds check against input_size.
- Reads the pixel value and computes bin = value % n_bins.
- Atomically increments histogram[bin] using atomicAdd.
Usage
This pattern is used whenever a custom DALI operator needs to execute a CUDA kernel. The .cu file contains both the kernel definition and the RunImpl() specialization for GPUBackend.
Code Reference
Source Location
- Repository: NVIDIA DALI
- File:
docs/examples/custom_operations/custom_operator/naive_histogram/naive_histogram.cu(lines 30-58)
Signature
// CUDA kernel
__global__ void naive_histogram_kernel(
const uint8_t *input, const int input_size, const int n_bins,
int32_t *histogram);
// RunImpl specialization for GPUBackend
template<>
void NaiveHistogram<GPUBackend>::RunImpl(Workspace &ws);
Import
#include "naive_histogram.h"
// CUDA runtime is implicitly available in .cu files
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ws | dali::Workspace & |
Yes | Workspace providing input tensors, output tensors, and the CUDA stream. |
| ws.Input<GPUBackend>(0) | const TensorList<GPUBackend> & |
Yes | Batch of single-channel uint8 images residing on the GPU. Each sample may have a different spatial size. |
| input (kernel) | const uint8_t * |
Yes | Pointer to the raw pixel data for a single sample. |
| input_size (kernel) | int |
Yes | Total number of pixels in the sample (product of all spatial dimensions). |
| n_bins (kernel) | int |
Yes | Number of histogram bins, passed from the operator's n_histogram_bins_ member. |
Outputs
| Name | Type | Description |
|---|---|---|
| ws.Output<GPUBackend>(0) | TensorList<GPUBackend> |
Batch of 1-D int32 histogram vectors, shape [n_bins] per sample. Pre-allocated by SetupImpl(). |
| histogram (kernel) | int32_t * |
Pointer to the output histogram array for a single sample; bins are incremented via atomicAdd. |
Usage Examples
Example: Full RunImpl with per-sample kernel launch
template<>
void NaiveHistogram<GPUBackend>::RunImpl(Workspace &ws) {
const auto &input = ws.Input<GPUBackend>(0);
const auto &shape = input.shape();
auto &output = ws.Output<GPUBackend>(0);
for (int sample_idx = 0; sample_idx < shape.num_samples(); sample_idx++) {
dim3 block_size(32);
auto input_size = volume(input.tensor_shape(sample_idx));
dim3 grid_size((input_size + block_size.x - 1) / block_size.x);
naive_histogram_kernel<<<grid_size, block_size, 0, ws.stream()>>>(
input[sample_idx].data<uint8_t>(),
input_size,
n_histogram_bins_,
output[sample_idx].mutable_data<int32_t>()
);
}
}
Example: CUDA kernel for naive histogram
__global__ void naive_histogram_kernel(
const uint8_t *input, const int input_size, const int n_bins,
int32_t *histogram) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= input_size) return;
auto value = input[tid];
int bin = value % n_bins;
atomicAdd(&histogram[bin], 1);
}