Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA DALI External Data Source

From Leeroopedia


Knowledge Sources
Domains Image_Processing, GPU_Computing, Data_Ingestion
Last Updated 2026-02-08 00:00 GMT

Overview

An external data source is a mechanism that allows user-controlled data -- loaded and managed outside the DALI pipeline -- to be injected into the pipeline graph at runtime as a named input node.

Description

External data sourcing addresses the fundamental problem of bridging arbitrary host-side data with a GPU-accelerated preprocessing pipeline. In many real-world workflows, data does not reside in a format or location that DALI's built-in readers (file readers, TFRecord readers, etc.) can consume directly. Examples include:

  • Images loaded via custom I/O libraries or received over a network socket.
  • Data produced by another framework (PyTorch Dataset, custom C++ loader).
  • Synthetically generated arrays or data already resident in CPU memory from a prior processing stage.

The external source pattern solves this by defining a named entry point in the pipeline graph. At each iteration, the caller supplies a batch of data (as numpy arrays or DALI tensors) to this named entry point. The pipeline then processes the data through the remainder of the graph just as if it had been read by an internal reader.

Key design considerations include:

  • Zero-copy transfer (no_copy=True) avoids duplicating data when the caller can guarantee the source buffer will not be modified until the pipeline has consumed it.
  • Blocking behavior (blocking=True) ensures the external source waits for data before proceeding, which is essential in dynamic execution mode.
  • Data type specification (dtype) enables the pipeline to validate and correctly interpret the incoming byte stream.

Usage

Use an external data source when data originates outside DALI's built-in reader operators -- for example, when integrating with a PyTorch Dataset, reading from a custom file format, or feeding data from an in-memory buffer. Prefer built-in readers (fn.readers.file, fn.readers.tfrecord) when data resides in standard on-disk formats for better performance.

Theoretical Basis

External data sourcing implements the data injection pattern from dataflow programming. In a dataflow graph, every node receives its inputs from upstream edges. An external source acts as a source node with zero upstream edges, whose values are bound imperatively by the host program at each execution step. This is analogous to feed_dict in TensorFlow 1.x or placeholder nodes in other graph-based frameworks. The correctness constraint is that the external source must provide exactly one batch of correctly shaped and typed data per pipeline iteration; violating this contract causes the pipeline to stall (blocking mode) or raise an error.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment