Workflow:NVIDIA DALI Custom Operator Development
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Plugin_Development, GPU_Computing, CUDA |
| Last Updated | 2026-02-08 17:00 GMT |
Overview
End-to-end process for developing, building, and integrating a custom C++ GPU operator into an NVIDIA DALI data pipeline.
Description
This workflow covers the full lifecycle of extending DALI with a custom operator. When the built-in operator library does not cover a specific data transformation, users can implement custom operators in C++ (with optional CUDA kernels for GPU execution) and load them as plugins at runtime. The process involves defining the operator class, registering it with DALI's operator schema, building it as a shared library, and loading it into a Python pipeline where it becomes available through the standard fn functional API.
Usage
Execute this workflow when you need a data preprocessing operation that is not available in DALI's built-in operator library. This is appropriate for domain-specific transformations (e.g., custom histogram computation, specialized augmentations, proprietary data format parsing) that must run within the DALI pipeline to maintain GPU-accelerated throughput.
Execution Steps
Step 1: Define the Operator Class
Create a C++ class that inherits from dali::Operator<Backend>, where Backend is either CPUBackend or GPUBackend. Implement the constructor to extract parameters from the OpSpec, SetupImpl to declare output tensor shapes and types, and RunImpl to execute the actual computation.
Key considerations:
- Template the class on Backend to support both CPU and GPU execution
- The constructor receives an OpSpec containing user-provided parameters
- SetupImpl must set the output shape and data type in the TensorListShape output
- RunImpl receives the Workspace with input and output tensor lists
Step 2: Register the Operator Schema
Use the DALI_REGISTER_OPERATOR macro to register the operator with DALI's operator registry, and define its schema (parameter names, types, defaults, documentation) using DALI_SCHEMA. This makes the operator discoverable and callable from Python.
Key considerations:
- The schema name becomes the Python-callable operator name (e.g., NaiveHistogram becomes fn.naive_histogram)
- Declare all parameters with types, defaults, and documentation strings
- Specify the number of inputs the operator accepts
- Mark whether the operator supports GPU execution
Step 3: Implement the Computation Kernel
Write the actual data processing logic inside RunImpl. For GPU operators, implement CUDA kernels that operate on device memory. Access input tensors from the workspace and write results to output tensors.
Key considerations:
- For GPU execution, use CUDA kernels launched within RunImpl
- Access input data via ws.Input<Backend>(idx) and output via ws.Output<Backend>(idx)
- Handle variable-length batches by iterating over samples in the tensor list
- Use the workspace's CUDA stream for GPU operations to ensure correct synchronization
Configure a CMake build that compiles the operator source into a shared library (.so). Link against the DALI core library headers and any required CUDA libraries.
Key considerations:
- Use find_package(DALI) or include DALI headers directly
- Enable CUDA compilation for GPU operator kernels
- The output is a single .so file that can be loaded at runtime
- Set the C++ standard to match DALI's requirements (C++17 or later)
Step 5: Load and Use in a Pipeline
In Python, load the compiled shared library using nvidia.dali.plugin_manager.load_library. The custom operator becomes immediately available through the fn namespace. Define a pipeline that calls the custom operator alongside built-in operators.
Key considerations:
- Call plugin_manager.load_library("path/to/libcustom_op.so") before defining the pipeline
- The operator is callable as fn.operator_name(input, param=value)
- Parameters defined in the schema are passed as keyword arguments
- The custom operator participates in DALI's graph optimization and execution scheduling
Step 6: Validate the Operator
Test the custom operator by running the pipeline and verifying outputs against reference implementations. Check correctness across different batch sizes, input shapes, and parameter configurations.
Key considerations:
- Compare operator output against a known-correct reference (e.g., NumPy or OpenCV equivalent)
- Test edge cases such as empty inputs, maximum batch sizes, and boundary parameter values
- Verify that GPU and CPU backends produce consistent results if both are implemented