Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA DALI RunImpl CUDA Kernel

From Leeroopedia


Knowledge Sources
Domains Custom_Operators, GPU_Computing, CUDA_Programming
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete GPU-specialized RunImpl() method and naive_histogram_kernel CUDA kernel provided by the NaiveHistogram example in NVIDIA DALI, demonstrating per-sample kernel launch, workspace stream integration, and atomic histogram accumulation.

Description

The NaiveHistogram<GPUBackend>::RunImpl() method (defined in naive_histogram.cu) is the explicit specialization of RunImpl() for the GPU backend. It:

  1. Retrieves the input batch via ws.Input<GPUBackend>(0), which returns a const TensorList<GPUBackend> &.
  2. Retrieves the output batch via ws.Output<GPUBackend>(0), which returns a mutable TensorList<GPUBackend> & with buffers pre-allocated by SetupImpl().
  3. Iterates over each sample in the batch using shape.num_samples().
  4. For each sample, computes the 1-D grid and block dimensions based on volume(input.tensor_shape(sample_idx)).
  5. Launches the naive_histogram_kernel on ws.stream(), passing per-sample data pointers obtained via input[sample_idx].data<uint8_t>() and output[sample_idx].mutable_data<int32_t>().

The naive_histogram_kernel is a __global__ CUDA function that:

  • Computes a global thread ID from blockIdx.x * blockDim.x + threadIdx.x.
  • Performs a bounds check against input_size.
  • Reads the pixel value and computes bin = value % n_bins.
  • Atomically increments histogram[bin] using atomicAdd.

Usage

This pattern is used whenever a custom DALI operator needs to execute a CUDA kernel. The .cu file contains both the kernel definition and the RunImpl() specialization for GPUBackend.

Code Reference

Source Location

  • Repository: NVIDIA DALI
  • File: docs/examples/custom_operations/custom_operator/naive_histogram/naive_histogram.cu (lines 30-58)

Signature

// CUDA kernel
__global__ void naive_histogram_kernel(
        const uint8_t *input, const int input_size, const int n_bins,
        int32_t *histogram);

// RunImpl specialization for GPUBackend
template<>
void NaiveHistogram<GPUBackend>::RunImpl(Workspace &ws);

Import

#include "naive_histogram.h"
// CUDA runtime is implicitly available in .cu files

I/O Contract

Inputs

Name Type Required Description
ws dali::Workspace & Yes Workspace providing input tensors, output tensors, and the CUDA stream.
ws.Input<GPUBackend>(0) const TensorList<GPUBackend> & Yes Batch of single-channel uint8 images residing on the GPU. Each sample may have a different spatial size.
input (kernel) const uint8_t * Yes Pointer to the raw pixel data for a single sample.
input_size (kernel) int Yes Total number of pixels in the sample (product of all spatial dimensions).
n_bins (kernel) int Yes Number of histogram bins, passed from the operator's n_histogram_bins_ member.

Outputs

Name Type Description
ws.Output<GPUBackend>(0) TensorList<GPUBackend> Batch of 1-D int32 histogram vectors, shape [n_bins] per sample. Pre-allocated by SetupImpl().
histogram (kernel) int32_t * Pointer to the output histogram array for a single sample; bins are incremented via atomicAdd.

Usage Examples

Example: Full RunImpl with per-sample kernel launch

template<>
void NaiveHistogram<GPUBackend>::RunImpl(Workspace &ws) {
  const auto &input = ws.Input<GPUBackend>(0);
  const auto &shape = input.shape();
  auto &output = ws.Output<GPUBackend>(0);

  for (int sample_idx = 0; sample_idx < shape.num_samples(); sample_idx++) {
    dim3 block_size(32);
    auto input_size = volume(input.tensor_shape(sample_idx));
    dim3 grid_size((input_size + block_size.x - 1) / block_size.x);

    naive_histogram_kernel<<<grid_size, block_size, 0, ws.stream()>>>(
            input[sample_idx].data<uint8_t>(),
            input_size,
            n_histogram_bins_,
            output[sample_idx].mutable_data<int32_t>()
    );
  }
}

Example: CUDA kernel for naive histogram

__global__ void naive_histogram_kernel(
        const uint8_t *input, const int input_size, const int n_bins,
        int32_t *histogram) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid >= input_size) return;
  auto value = input[tid];
  int bin = value % n_bins;
  atomicAdd(&histogram[bin], 1);
}

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment