Principle:NVIDIA DALI Annotation Reading

Knowledge Sources	NVIDIA DALI Documentation
Domains	Object_Detection, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

Annotation reading is the process of extracting images and their associated object detection labels (bounding boxes and class IDs) from structured dataset formats into a normalized in-memory representation.

Description

Annotation Reading for object detection involves parsing serialized dataset files -- such as TFRecord shards or COCO JSON annotations paired with image directories -- and producing a uniform tuple of tensors: decoded images, bounding box coordinates, class labels, and original image dimensions.

The challenge lies in the diversity of storage formats:

TFRecord format: Bounding boxes are stored as separate per-coordinate variable-length features (xmin, ymin, xmax, ymax). These must be stacked and transposed into an [N, 4] tensor. Class labels are stored as variable-length int64 features. Image dimensions are stored as scalar int64 features.
COCO format: A single JSON file maps image IDs to file paths and annotation records. The reader must join images with their annotations, decode bounding box coordinates (often in [x, y, w, h] format), and convert to the desired coordinate convention.

Regardless of the source format, the output must be a consistent 5-tuple of (images, bboxes, classes, widths, heights) so that downstream pipeline stages (augmentation, encoding) operate identically on data from any source.

Additional concerns include:

Sharding: Splitting the dataset across multiple workers for data-parallel training.
Random shuffling: Randomizing sample order during training to reduce overfitting.
Image decoding: Converting compressed image bytes (JPEG, PNG) to raw pixel tensors, optionally on GPU via mixed-device decoding.

Usage

Use this principle when building data pipelines that must support multiple annotation formats (TFRecord, COCO, VOC) while presenting a unified interface to the rest of the preprocessing pipeline.

Theoretical Basis

The annotation reading stage can be formalized as a mapping function:

R: Dataset_Format -> (Image[H,W,3], BBox[N,4], Class[N], Width, Height)

where N is the variable number of object annotations per image. The bounding box convention must be standardized. Common conventions include:

ltrb (xyXY): (x_min, y_min, x_max, y_max) -- used by COCO reader with ltrb=True
xywh: (x_center, y_center, width, height) -- native COCO format
tlbr: (y_min, x_min, y_max, x_max) -- used internally by some detection frameworks

When ratio=True, coordinates are normalized to [0, 1] relative to image dimensions, making them invariant to image resolution. This normalization is essential for subsequent spatial transforms that operate in normalized coordinate space.

For sharded reading, the dataset of S samples is partitioned into K shards such that shard k reads samples at indices { i : i mod K == k }, ensuring each worker processes a disjoint subset.

Related Pages

Implemented By

Implementation:NVIDIA_DALI_Ops_Util_Input_Readers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment