Implementation:NVIDIA DALI Legacy TF Op
| Knowledge Sources | |
|---|---|
| Domains | TensorFlow_Integration, Data_Pipeline |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Implements the legacy "Dali" TensorFlow op kernel that runs a serialized DALI pipeline and produces output tensors, with support for both dense and sparse tensor representations.
Description
This file contains the original DALI TensorFlow integration op, registered as REGISTER_OP("Dali"). Unlike the newer DALIDataset op which integrates with tf.data.Dataset, this legacy op is a standard TensorFlow OpKernel that deserializes and runs a DALI pipeline within its Compute method. The DaliOp class constructs the pipeline in its constructor by deserializing the pipeline string, configuring execution parameters (batch size, threads, device, prefetch queue depths, separated/dynamic execution), and performing initial prefetch warmup.
The Compute / ComputeImpl method implements the core data production loop: it pops the next set of outputs from the pipeline, allocates TensorFlow output tensors, copies data from DALI buffers to TF tensors (with appropriate stream synchronization for GPU operations), releases the pipeline outputs, and triggers the next pipeline run. The op supports sparse tensor output through the sparse attribute, which generates three tensors per sparse output (indices, values, and dense shape) by enumerating multi-dimensional indices using the recursive EnumerateIndices and EnumerateIndicesWithinSample helper functions.
The op registers kernels for both CPU and GPU devices. It performs cross-device copy when the pipeline output device differs from the TensorFlow device placement, which is necessary because TensorFlow may run constant propagation on CPU regardless of op placement. Timing instrumentation is included for profiling output retrieval, memory allocation/copy, and pipeline execution phases.
Usage
This op is used via the lower-level dali_tf function call in the DALI TF plugin, primarily for TensorFlow 1.x Session-based workflows. For TensorFlow 2.x, the DALIDataset op is the recommended interface, but this legacy op remains available for backward compatibility and for use cases requiring sparse tensor output.
Code Reference
Source Location
- Repository: NVIDIA_DALI
- File: dali_tf_plugin/daliop.cc
- Lines: 1-501
Signature
namespace dali_tf_impl {
class DaliOp : public tf::OpKernel {
public:
explicit DaliOp(tf::OpKernelConstruction* context);
~DaliOp() override;
void Compute(tf::OpKernelContext* context) override;
void ComputeImpl(tf::OpKernelContext* context);
};
REGISTER_OP("Dali")
.Attr("serialized_pipeline: string")
.Attr("shapes: list(shape) >= 1")
.Attr("num_threads: int = -1")
.Attr("device_id: int = -1")
.Attr("exec_separated: bool = false")
.Attr("exec_dynamic: bool = false")
.Attr("gpu_prefetch_queue_depth: int = 2")
.Attr("cpu_prefetch_queue_depth: int = 2")
.Attr("sparse: list(bool) = []")
.Attr("batch_size: int = -1")
.Attr("enable_memory_stats: bool = false")
.Output("data: dtypes")
.Attr("dtypes: list({half, float, uint8, int16, int32, int64}) >= 1");
REGISTER_KERNEL_BUILDER(Name("Dali").Device(tf::DEVICE_GPU), DaliOp);
REGISTER_KERNEL_BUILDER(Name("Dali").Device(tf::DEVICE_CPU), DaliOp);
} // namespace dali_tf_impl
Import
#include "dali/dali.h"
#include "dali/dali_cpp_wrappers.h"
#include "dali/core/common.h"
#include "dali/core/small_vector.h"
#include "dali_tf_plugin/dali_helper.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| serialized_pipeline | string | Yes | Serialized DALI pipeline definition string |
| shapes | list(shape) | Yes | Expected output tensor shapes (one per output) |
| dtypes | list(type) | Yes | Expected output tensor data types (half, float, uint8, int16, int32, int64) |
| num_threads | int | No | Number of CPU threads for DALI (default -1, auto-detect) |
| device_id | int | No | GPU device ID (default -1 for CPU) |
| exec_separated | bool | No | Use separated executor (default false) |
| exec_dynamic | bool | No | Use dynamic executor (default false) |
| gpu_prefetch_queue_depth | int | No | GPU prefetch queue depth (default 2) |
| cpu_prefetch_queue_depth | int | No | CPU prefetch queue depth (default 2) |
| sparse | list(bool) | No | Whether each output should be sparse (default empty = all dense) |
| batch_size | int | No | Override batch size (default -1 = infer from shapes) |
| enable_memory_stats | bool | No | Enable DALI memory statistics (default false) |
Outputs
| Name | Type | Description |
|---|---|---|
| data | list(Tensor) | Output tensors from the DALI pipeline. Dense outputs match shapes[i]. Sparse outputs produce 3 tensors: indices (N x ndim int64), values (N x dtype), dense_shape (ndim int64). |
Usage Examples
Legacy Session-based Usage
import tensorflow as tf
from nvidia.dali.plugin.tf import DALIRawIterator
# Load the DALI TF plugin
dali_tf_module = tf.load_op_library("libdali_tf_current.so")
# Create the legacy op
with tf.device("/gpu:0"):
data = dali_tf_module.dali(
serialized_pipeline=pipe.serialize(),
shapes=[(batch_size, 224, 224, 3), (batch_size, 1)],
dtypes=[tf.float32, tf.int32],
device_id=0,
batch_size=batch_size,
num_threads=4,
)
with tf.compat.v1.Session() as sess:
for step in range(num_steps):
images, labels = sess.run(data)