Implementation:NVIDIA DALI C API Legacy

Knowledge Sources	NVIDIA_DALI
Domains	Data_Pipeline, C_API
Last Updated	2026-02-08 16:00 GMT

Overview

The legacy DALI C API (v1) provides a C-callable interface for creating, configuring, running, and managing NVIDIA DALI data processing pipelines from non-Python environments.

⚠️ DEPRECATION WARNING: Several functions in this API are deprecated. See Heuristic:NVIDIA_DALI_Warning_Deprecated_C_API_V1_Functions for migration guidance. For new integrations, prefer C API v2 (dali/dali.h).

Description

This file implements the original (v1) C API for the NVIDIA DALI (Data Loading Library) framework. It provides a comprehensive set of C functions that wrap the C++ DALI Pipeline class, enabling pipeline lifecycle management (creation, serialization, deserialization, deletion), external input feeding (both contiguous and per-tensor allocations), output retrieval and copying, operator metadata and trace inspection, checkpointing and restoration, memory management, and plugin loading.

The implementation centers around the DALIPipeline struct, which aggregates a dali::Pipeline instance along with auxiliary objects such as batch size maps, data ID maps, a workspace, and a CUDA copy stream. The C API functions operate through opaque daliPipelineHandle pointers that reference these aggregated objects. Template helper functions are used internally to dispatch operations to the appropriate CPU or GPU backend.

The API supports multiple pipeline creation variants (daliCreatePipeline, daliCreatePipeline2, daliCreatePipeline3) with increasing levels of configuration control, including executor type flags. It also provides memory preallocations, checkpointing for fault-tolerant training, and deprecated compatibility shims for older function signatures.

Usage

Use this API when integrating DALI pipelines into C or C++ applications, or when building language bindings (e.g., for TensorFlow, PyTorch, or Triton Inference Server plugins). Typical scenarios include: deserializing a Python-defined pipeline for inference serving, feeding external data into pipeline inputs from a custom data source, copying pipeline outputs to framework-specific tensor buffers, and implementing checkpointing in distributed training systems.

Code Reference

Source Location

Repository: NVIDIA_DALI
File: dali/c_api/c_api.cc
Lines: 1-908

Signature

void daliInitialize();
void daliCreatePipeline(daliPipelineHandle *pipe_handle, const char *serialized_pipeline,
                        int length, int max_batch_size, int num_threads, int device_id,
                        int separated_execution, int prefetch_queue_depth,
                        int cpu_prefetch_queue_depth, int gpu_prefetch_queue_depth,
                        int enable_memory_stats);
void daliCreatePipeline2(daliPipelineHandle *pipe_handle, const char *serialized_pipeline,
                         int length, int max_batch_size, int num_threads, int device_id,
                         int pipelined_execution, int async_execution, int separated_execution,
                         int prefetch_queue_depth, int cpu_prefetch_queue_depth,
                         int gpu_prefetch_queue_depth, int enable_memory_stats);
void daliCreatePipeline3(daliPipelineHandle *pipe_handle, const char *serialized_pipeline,
                         int length, int max_batch_size, int num_threads, int device_id,
                         dali_exec_flags_t exec_flags, int prefetch_queue_depth,
                         int cpu_prefetch_queue_depth, int gpu_prefetch_queue_depth,
                         int enable_memory_stats);
void daliDeserializeDefault(daliPipelineHandle *pipe_handle, const char *serialized_pipeline,
                            int length);
void daliRun(daliPipelineHandle_t pipe_handle);
void daliOutput(daliPipelineHandle_t pipe_handle);
void daliShareOutput(daliPipelineHandle_t pipe_handle);
void daliOutputRelease(daliPipelineHandle_t pipe_handle);
void daliOutputCopy(daliPipelineHandle_t pipe_handle, void *dst, int output_idx,
                    device_type_t dst_type, cudaStream_t stream, unsigned int flags);
void daliDeletePipeline(daliPipelineHandle_t pipe_handle);
void daliSetExternalInput(daliPipelineHandle_t pipe_handle, const char *name, device_type_t device,
                          const void *data_ptr, dali_data_type_t data_type, const int64_t *shapes,
                          int sample_dim, const char *layout_str, unsigned int flags);

Import

#include "dali/c_api.h"

I/O Contract

Inputs

Name	Type	Required	Description
serialized_pipeline	const char *	Yes	Protobuf-serialized DALI pipeline definition
length	int	Yes	Length in bytes of the serialized pipeline
max_batch_size	int	Yes	Maximum number of samples per batch
num_threads	int	Yes	Number of CPU threads for pipeline execution
device_id	int	Yes	CUDA device ID (-1 for CPU-only)
data_ptr	const void *	Yes	Pointer to external input data (CPU or GPU memory)
shapes	const int64_t *	Yes	Flat array of sample shapes
sample_dim	int	Yes	Number of dimensions per sample

Outputs

Name	Type	Description
pipe_handle	daliPipelineHandle *	Opaque handle to the created pipeline
dst	void *	Destination buffer for output data copy
shape	int64_t *	0-terminated shape array (caller must free with daliFree)

Usage Examples

Create and Run a Pipeline

#include "dali/c_api.h"

// Initialize DALI
daliInitialize();

// Deserialize and create a pipeline from protobuf
daliPipelineHandle handle;
daliCreatePipeline(&handle, serialized_data, serialized_len,
                   batch_size, num_threads, device_id,
                   /*separated=*/0, prefetch_depth,
                   prefetch_depth, prefetch_depth,
                   /*enable_memory_stats=*/0);

// Prefetch initial batches
daliPrefetch(&handle);

// Get outputs
daliOutput(&handle);

// Query output info
unsigned num_outputs = daliGetNumOutput(&handle);
size_t num_tensors = daliNumTensors(&handle, 0);

// Copy output to user buffer
size_t total_size = daliTensorSize(&handle, 0);
void *output_buf = malloc(total_size);
daliOutputCopy(&handle, output_buf, 0, CPU, 0, DALI_ext_default);

// Cleanup
free(output_buf);
daliDeletePipeline(&handle);

Feed External Input

#include "dali/c_api.h"

// Set batch size for the input operator
daliSetExternalInputBatchSize(&handle, "input_name", current_batch_size);

// Feed contiguous data
int64_t shapes[] = {480, 640, 3, 600, 800, 3};
daliSetExternalInputAsync(&handle, "input_name", CPU,
                          data_ptr, DALI_UINT8, shapes,
                          /*sample_dim=*/3, "HWC",
                          cuda_stream, DALI_ext_default);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment