Principle:NVIDIA DALI Annotation Reading
| Knowledge Sources | |
|---|---|
| Domains | Object_Detection, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Annotation reading is the process of extracting images and their associated object detection labels (bounding boxes and class IDs) from structured dataset formats into a normalized in-memory representation.
Description
Annotation Reading for object detection involves parsing serialized dataset files -- such as TFRecord shards or COCO JSON annotations paired with image directories -- and producing a uniform tuple of tensors: decoded images, bounding box coordinates, class labels, and original image dimensions.
The challenge lies in the diversity of storage formats:
- TFRecord format: Bounding boxes are stored as separate per-coordinate variable-length features (xmin, ymin, xmax, ymax). These must be stacked and transposed into an [N, 4] tensor. Class labels are stored as variable-length int64 features. Image dimensions are stored as scalar int64 features.
- COCO format: A single JSON file maps image IDs to file paths and annotation records. The reader must join images with their annotations, decode bounding box coordinates (often in [x, y, w, h] format), and convert to the desired coordinate convention.
Regardless of the source format, the output must be a consistent 5-tuple of (images, bboxes, classes, widths, heights) so that downstream pipeline stages (augmentation, encoding) operate identically on data from any source.
Additional concerns include:
- Sharding: Splitting the dataset across multiple workers for data-parallel training.
- Random shuffling: Randomizing sample order during training to reduce overfitting.
- Image decoding: Converting compressed image bytes (JPEG, PNG) to raw pixel tensors, optionally on GPU via mixed-device decoding.
Usage
Use this principle when building data pipelines that must support multiple annotation formats (TFRecord, COCO, VOC) while presenting a unified interface to the rest of the preprocessing pipeline.
Theoretical Basis
The annotation reading stage can be formalized as a mapping function:
R: Dataset_Format -> (Image[H,W,3], BBox[N,4], Class[N], Width, Height)
where N is the variable number of object annotations per image. The bounding box convention must be standardized. Common conventions include:
- ltrb (xyXY): (x_min, y_min, x_max, y_max) -- used by COCO reader with ltrb=True
- xywh: (x_center, y_center, width, height) -- native COCO format
- tlbr: (y_min, x_min, y_max, x_max) -- used internally by some detection frameworks
When ratio=True, coordinates are normalized to [0, 1] relative to image dimensions, making them invariant to image resolution. This normalization is essential for subsequent spatial transforms that operate in normalized coordinate space.
For sharded reading, the dataset of S samples is partitioned into K shards such that shard k reads samples at indices { i : i mod K == k }, ensuring each worker processes a disjoint subset.