Workflow:NVIDIA DALI Custom Operator Development

Knowledge Sources	NVIDIA DALI DALI Documentation DALI Custom Operators
Domains	Data_Loading, Plugin_Development, GPU_Computing, CUDA
Last Updated	2026-02-08 17:00 GMT

Overview

End-to-end process for developing, building, and integrating a custom C++ GPU operator into an NVIDIA DALI data pipeline.

Description

This workflow covers the full lifecycle of extending DALI with a custom operator. When the built-in operator library does not cover a specific data transformation, users can implement custom operators in C++ (with optional CUDA kernels for GPU execution) and load them as plugins at runtime. The process involves defining the operator class, registering it with DALI's operator schema, building it as a shared library, and loading it into a Python pipeline where it becomes available through the standard fn functional API.

Usage

Execute this workflow when you need a data preprocessing operation that is not available in DALI's built-in operator library. This is appropriate for domain-specific transformations (e.g., custom histogram computation, specialized augmentations, proprietary data format parsing) that must run within the DALI pipeline to maintain GPU-accelerated throughput.

Execution Steps

Step 1: Define the Operator Class

Create a C++ class that inherits from dali::Operator<Backend>, where Backend is either CPUBackend or GPUBackend. Implement the constructor to extract parameters from the OpSpec, SetupImpl to declare output tensor shapes and types, and RunImpl to execute the actual computation.

Key considerations:

Template the class on Backend to support both CPU and GPU execution
The constructor receives an OpSpec containing user-provided parameters
SetupImpl must set the output shape and data type in the TensorListShape output
RunImpl receives the Workspace with input and output tensor lists

Step 2: Register the Operator Schema

Use the DALI_REGISTER_OPERATOR macro to register the operator with DALI's operator registry, and define its schema (parameter names, types, defaults, documentation) using DALI_SCHEMA. This makes the operator discoverable and callable from Python.

Key considerations:

The schema name becomes the Python-callable operator name (e.g., NaiveHistogram becomes fn.naive_histogram)
Declare all parameters with types, defaults, and documentation strings
Specify the number of inputs the operator accepts
Mark whether the operator supports GPU execution

Step 3: Implement the Computation Kernel

Write the actual data processing logic inside RunImpl. For GPU operators, implement CUDA kernels that operate on device memory. Access input tensors from the workspace and write results to output tensors.

Key considerations:

For GPU execution, use CUDA kernels launched within RunImpl
Access input data via ws.Input<Backend>(idx) and output via ws.Output<Backend>(idx)
Handle variable-length batches by iterating over samples in the tensor list
Use the workspace's CUDA stream for GPU operations to ensure correct synchronization

Step 4: Build as a Shared Library

Configure a CMake build that compiles the operator source into a shared library (.so). Link against the DALI core library headers and any required CUDA libraries.

Key considerations:

Use find_package(DALI) or include DALI headers directly
Enable CUDA compilation for GPU operator kernels
The output is a single .so file that can be loaded at runtime
Set the C++ standard to match DALI's requirements (C++17 or later)

Step 5: Load and Use in a Pipeline

In Python, load the compiled shared library using nvidia.dali.plugin_manager.load_library. The custom operator becomes immediately available through the fn namespace. Define a pipeline that calls the custom operator alongside built-in operators.

Key considerations:

Call plugin_manager.load_library("path/to/libcustom_op.so") before defining the pipeline
The operator is callable as fn.operator_name(input, param=value)
Parameters defined in the schema are passed as keyword arguments
The custom operator participates in DALI's graph optimization and execution scheduling

Step 6: Validate the Operator

Test the custom operator by running the pipeline and verifying outputs against reference implementations. Check correctness across different batch sizes, input shapes, and parameter configurations.

Key considerations:

Compare operator output against a known-correct reference (e.g., NumPy or OpenCV equivalent)
Test edge cases such as empty inputs, maximum batch sizes, and boundary parameter values
Verify that GPU and CPU backends produce consistent results if both are implemented

Execution Diagram

GitHub URL

Workflow Repository