Implementation:NVIDIA DALI Legacy TF Op

Knowledge Sources	NVIDIA_DALI
Domains	TensorFlow_Integration, Data_Pipeline
Last Updated	2026-02-08 16:00 GMT

Overview

Implements the legacy "Dali" TensorFlow op kernel that runs a serialized DALI pipeline and produces output tensors, with support for both dense and sparse tensor representations.

Description

This file contains the original DALI TensorFlow integration op, registered as REGISTER_OP("Dali"). Unlike the newer DALIDataset op which integrates with tf.data.Dataset, this legacy op is a standard TensorFlow OpKernel that deserializes and runs a DALI pipeline within its Compute method. The DaliOp class constructs the pipeline in its constructor by deserializing the pipeline string, configuring execution parameters (batch size, threads, device, prefetch queue depths, separated/dynamic execution), and performing initial prefetch warmup.

The Compute / ComputeImpl method implements the core data production loop: it pops the next set of outputs from the pipeline, allocates TensorFlow output tensors, copies data from DALI buffers to TF tensors (with appropriate stream synchronization for GPU operations), releases the pipeline outputs, and triggers the next pipeline run. The op supports sparse tensor output through the sparse attribute, which generates three tensors per sparse output (indices, values, and dense shape) by enumerating multi-dimensional indices using the recursive EnumerateIndices and EnumerateIndicesWithinSample helper functions.

The op registers kernels for both CPU and GPU devices. It performs cross-device copy when the pipeline output device differs from the TensorFlow device placement, which is necessary because TensorFlow may run constant propagation on CPU regardless of op placement. Timing instrumentation is included for profiling output retrieval, memory allocation/copy, and pipeline execution phases.

Usage

This op is used via the lower-level dali_tf function call in the DALI TF plugin, primarily for TensorFlow 1.x Session-based workflows. For TensorFlow 2.x, the DALIDataset op is the recommended interface, but this legacy op remains available for backward compatibility and for use cases requiring sparse tensor output.

Code Reference

Source Location

Repository: NVIDIA_DALI
File: dali_tf_plugin/daliop.cc
Lines: 1-501

Signature

namespace dali_tf_impl {

class DaliOp : public tf::OpKernel {
 public:
  explicit DaliOp(tf::OpKernelConstruction* context);
  ~DaliOp() override;
  void Compute(tf::OpKernelContext* context) override;
  void ComputeImpl(tf::OpKernelContext* context);
};

REGISTER_OP("Dali")
    .Attr("serialized_pipeline: string")
    .Attr("shapes: list(shape) >= 1")
    .Attr("num_threads: int = -1")
    .Attr("device_id: int = -1")
    .Attr("exec_separated: bool = false")
    .Attr("exec_dynamic: bool = false")
    .Attr("gpu_prefetch_queue_depth: int = 2")
    .Attr("cpu_prefetch_queue_depth: int = 2")
    .Attr("sparse: list(bool) = []")
    .Attr("batch_size: int = -1")
    .Attr("enable_memory_stats: bool = false")
    .Output("data: dtypes")
    .Attr("dtypes: list({half, float, uint8, int16, int32, int64}) >= 1");

REGISTER_KERNEL_BUILDER(Name("Dali").Device(tf::DEVICE_GPU), DaliOp);
REGISTER_KERNEL_BUILDER(Name("Dali").Device(tf::DEVICE_CPU), DaliOp);

}  // namespace dali_tf_impl

Import

#include "dali/dali.h"
#include "dali/dali_cpp_wrappers.h"
#include "dali/core/common.h"
#include "dali/core/small_vector.h"
#include "dali_tf_plugin/dali_helper.h"

I/O Contract

Inputs

Name	Type	Required	Description
serialized_pipeline	string	Yes	Serialized DALI pipeline definition string
shapes	list(shape)	Yes	Expected output tensor shapes (one per output)
dtypes	list(type)	Yes	Expected output tensor data types (half, float, uint8, int16, int32, int64)
num_threads	int	No	Number of CPU threads for DALI (default -1, auto-detect)
device_id	int	No	GPU device ID (default -1 for CPU)
exec_separated	bool	No	Use separated executor (default false)
exec_dynamic	bool	No	Use dynamic executor (default false)
gpu_prefetch_queue_depth	int	No	GPU prefetch queue depth (default 2)
cpu_prefetch_queue_depth	int	No	CPU prefetch queue depth (default 2)
sparse	list(bool)	No	Whether each output should be sparse (default empty = all dense)
batch_size	int	No	Override batch size (default -1 = infer from shapes)
enable_memory_stats	bool	No	Enable DALI memory statistics (default false)

Outputs

Name	Type	Description
data	list(Tensor)	Output tensors from the DALI pipeline. Dense outputs match shapes[i]. Sparse outputs produce 3 tensors: indices (N x ndim int64), values (N x dtype), dense_shape (ndim int64).

Usage Examples

Legacy Session-based Usage

import tensorflow as tf
from nvidia.dali.plugin.tf import DALIRawIterator

# Load the DALI TF plugin
dali_tf_module = tf.load_op_library("libdali_tf_current.so")

# Create the legacy op
with tf.device("/gpu:0"):
    data = dali_tf_module.dali(
        serialized_pipeline=pipe.serialize(),
        shapes=[(batch_size, 224, 224, 3), (batch_size, 1)],
        dtypes=[tf.float32, tf.int32],
        device_id=0,
        batch_size=batch_size,
        num_threads=4,
    )

with tf.compat.v1.Session() as sess:
    for step in range(num_steps):
        images, labels = sess.run(data)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment