Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA DALI Legacy TF Op

From Leeroopedia


Knowledge Sources
Domains TensorFlow_Integration, Data_Pipeline
Last Updated 2026-02-08 16:00 GMT

Overview

Implements the legacy "Dali" TensorFlow op kernel that runs a serialized DALI pipeline and produces output tensors, with support for both dense and sparse tensor representations.

Description

This file contains the original DALI TensorFlow integration op, registered as REGISTER_OP("Dali"). Unlike the newer DALIDataset op which integrates with tf.data.Dataset, this legacy op is a standard TensorFlow OpKernel that deserializes and runs a DALI pipeline within its Compute method. The DaliOp class constructs the pipeline in its constructor by deserializing the pipeline string, configuring execution parameters (batch size, threads, device, prefetch queue depths, separated/dynamic execution), and performing initial prefetch warmup.

The Compute / ComputeImpl method implements the core data production loop: it pops the next set of outputs from the pipeline, allocates TensorFlow output tensors, copies data from DALI buffers to TF tensors (with appropriate stream synchronization for GPU operations), releases the pipeline outputs, and triggers the next pipeline run. The op supports sparse tensor output through the sparse attribute, which generates three tensors per sparse output (indices, values, and dense shape) by enumerating multi-dimensional indices using the recursive EnumerateIndices and EnumerateIndicesWithinSample helper functions.

The op registers kernels for both CPU and GPU devices. It performs cross-device copy when the pipeline output device differs from the TensorFlow device placement, which is necessary because TensorFlow may run constant propagation on CPU regardless of op placement. Timing instrumentation is included for profiling output retrieval, memory allocation/copy, and pipeline execution phases.

Usage

This op is used via the lower-level dali_tf function call in the DALI TF plugin, primarily for TensorFlow 1.x Session-based workflows. For TensorFlow 2.x, the DALIDataset op is the recommended interface, but this legacy op remains available for backward compatibility and for use cases requiring sparse tensor output.

Code Reference

Source Location

Signature

namespace dali_tf_impl {

class DaliOp : public tf::OpKernel {
 public:
  explicit DaliOp(tf::OpKernelConstruction* context);
  ~DaliOp() override;
  void Compute(tf::OpKernelContext* context) override;
  void ComputeImpl(tf::OpKernelContext* context);
};

REGISTER_OP("Dali")
    .Attr("serialized_pipeline: string")
    .Attr("shapes: list(shape) >= 1")
    .Attr("num_threads: int = -1")
    .Attr("device_id: int = -1")
    .Attr("exec_separated: bool = false")
    .Attr("exec_dynamic: bool = false")
    .Attr("gpu_prefetch_queue_depth: int = 2")
    .Attr("cpu_prefetch_queue_depth: int = 2")
    .Attr("sparse: list(bool) = []")
    .Attr("batch_size: int = -1")
    .Attr("enable_memory_stats: bool = false")
    .Output("data: dtypes")
    .Attr("dtypes: list({half, float, uint8, int16, int32, int64}) >= 1");

REGISTER_KERNEL_BUILDER(Name("Dali").Device(tf::DEVICE_GPU), DaliOp);
REGISTER_KERNEL_BUILDER(Name("Dali").Device(tf::DEVICE_CPU), DaliOp);

}  // namespace dali_tf_impl

Import

#include "dali/dali.h"
#include "dali/dali_cpp_wrappers.h"
#include "dali/core/common.h"
#include "dali/core/small_vector.h"
#include "dali_tf_plugin/dali_helper.h"

I/O Contract

Inputs

Name Type Required Description
serialized_pipeline string Yes Serialized DALI pipeline definition string
shapes list(shape) Yes Expected output tensor shapes (one per output)
dtypes list(type) Yes Expected output tensor data types (half, float, uint8, int16, int32, int64)
num_threads int No Number of CPU threads for DALI (default -1, auto-detect)
device_id int No GPU device ID (default -1 for CPU)
exec_separated bool No Use separated executor (default false)
exec_dynamic bool No Use dynamic executor (default false)
gpu_prefetch_queue_depth int No GPU prefetch queue depth (default 2)
cpu_prefetch_queue_depth int No CPU prefetch queue depth (default 2)
sparse list(bool) No Whether each output should be sparse (default empty = all dense)
batch_size int No Override batch size (default -1 = infer from shapes)
enable_memory_stats bool No Enable DALI memory statistics (default false)

Outputs

Name Type Description
data list(Tensor) Output tensors from the DALI pipeline. Dense outputs match shapes[i]. Sparse outputs produce 3 tensors: indices (N x ndim int64), values (N x dtype), dense_shape (ndim int64).

Usage Examples

Legacy Session-based Usage

import tensorflow as tf
from nvidia.dali.plugin.tf import DALIRawIterator

# Load the DALI TF plugin
dali_tf_module = tf.load_op_library("libdali_tf_current.so")

# Create the legacy op
with tf.device("/gpu:0"):
    data = dali_tf_module.dali(
        serialized_pipeline=pipe.serialize(),
        shapes=[(batch_size, 224, 224, 3), (batch_size, 1)],
        dtypes=[tf.float32, tf.int32],
        device_id=0,
        batch_size=batch_size,
        num_threads=4,
    )

with tf.compat.v1.Session() as sess:
    for step in range(num_steps):
        images, labels = sess.run(data)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment