Implementation:NVIDIA DALI Fn Box Encoder

Knowledge Sources	NVIDIA DALI
Domains	Object_Detection, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete anchor box encoding pipeline stage using dali.fn.box_encoder, dali.fn.coord_transform, dali.fn.reshape, and dali.fn.pad for producing per-level detection training targets, provided by the DALI EfficientDet pipeline.

Description

The anchor box encoding stage in the EfficientDet DALI pipeline converts variable-length ground-truth bounding boxes and class labels into the fixed-size, per-anchor, per-level targets required by the EfficientDet detection head. This is implemented as a sequence of DALI operations within the _define_pipeline and _unpack_labels methods of EfficientDetPipeline.

The encoding process consists of:

Box encoding (dali.fn.box_encoder): Matches each ground-truth box to the closest pre-computed anchor (using IoU), then encodes the matched boxes as regression offsets. The anchors parameter is a flat list of normalized anchor coordinates in ltrb format. The offset=True flag enables offset encoding rather than direct coordinate output. Returns (enc_bboxes, enc_classes) where enc_bboxes has shape [A, 4] and enc_classes has shape [A], with A being the total number of anchors.

Positive count: The number of positive (non-background) anchors is computed as the sum of a cast of enc_classes != 0 to float. Class labels are then decremented by 1 (enc_classes -= 1) so that background becomes -1 and class indices start at 0.

Coordinate conversion (dali.fn.coord_transform): The encoded boxes are transformed from ltrb to tlbr order using a 4x4 permutation matrix M that swaps the x and y coordinates.

Per-level unpacking (_unpack_labels): The flat [A, 4] and [A] tensors are split by feature pyramid level (levels 3-7) and reshaped into [feat_h, feat_w, anchors_per_loc * 4] for boxes and [feat_h, feat_w, anchors_per_loc] for classes.

Padding (dali.fn.pad): The raw (unencoded) ground-truth boxes and classes are padded to (max_instances_per_image, 4) and (max_instances_per_image,) respectively, with fill value -1, for use in auxiliary computations.

The pre-computed anchors are generated by the Anchors class with min_level=3, max_level=7, num_scales=3, aspect_ratios=[1.0, 2.0, 0.5], and anchor_scale=4.0.

Usage

This encoding is performed automatically within the EfficientDetPipeline._define_pipeline method. It is not called directly by users but can be understood as the label-encoding stage that runs after all image augmentations.

Code Reference

Source Location

Repository: NVIDIA DALI
File: docs/examples/use_cases/tensorflow/efficientdet/pipeline/dali/efficientdet_pipeline.py (lines 128-171)

Signature

# Box encoding within _define_pipeline:
enc_bboxes, enc_classes = dali.fn.box_encoder(
    bboxes, classes, anchors=self._boxes, offset=True
)

# Coordinate transform (ltrb -> tlbr):
enc_bboxes = dali.fn.coord_transform(
    enc_bboxes,
    M=[0, 1, 0, 0,
       1, 0, 0, 0,
       0, 0, 0, 1,
       0, 0, 1, 0]
)

# Per-level reshape:
dali.fn.reshape(
    enc_bboxes[count:count + steps, 0:4],
    shape=[feat_h, feat_w, -1],
)

# Padding:
dali.fn.pad(bboxes, fill_value=-1, shape=(max_instances, 4))
dali.fn.pad(classes, fill_value=-1, shape=(max_instances,))

Import

import nvidia.dali as dali
from pipeline import anchors

# Anchor pre-computation:
anchor_obj = anchors.Anchors(3, 7, 3, [1.0, 2.0, 0.5], 4.0, image_size)
boxes = anchor_obj.boxes  # shape [A, 4]

I/O Contract

Inputs

Name	Type	Required	Description
bboxes	DALI TensorList	Yes	Ground-truth bounding boxes in normalized ltrb format, shape [N, 4] per sample.
classes	DALI TensorList	Yes	Ground-truth class labels, shape [N] per sample (1-indexed, where 0 is reserved for background).
anchors	list[float]	Yes	Flat list of pre-computed anchor box coordinates in normalized ltrb format, length A 4*.
offset	bool	Yes	When True, encodes regression offsets rather than raw coordinates.
fill_value	int/float	Yes (pad)	Value used to pad absent ground-truth entries. Typically -1.
shape	tuple	Yes (pad)	Target shape for padded output: (max_instances_per_image, 4) for boxes, (max_instances_per_image,) for classes.

Outputs

Name	Type	Description
enc_bboxes_layers	list[DALI TensorList]	Per-level encoded bbox regression targets, each of shape [feat_h, feat_w, anchors_per_loc 4]*.
enc_classes_layers	list[DALI TensorList]	Per-level encoded class targets, each of shape [feat_h, feat_w, anchors_per_loc].
num_positives	DALI TensorList	Scalar float32 count of positive (non-background) anchors per sample.
bboxes (padded)	DALI TensorList	Padded ground-truth boxes, shape (max_instances_per_image, 4), with -1 fill.
classes (padded)	DALI TensorList	Padded ground-truth classes, shape (max_instances_per_image,), with -1 fill.

Usage Examples

Anchor Encoding Within the Pipeline

import nvidia.dali as dali
from pipeline import anchors

# Pre-compute anchors
anchor_obj = anchors.Anchors(3, 7, 3, [1.0, 2.0, 0.5], 4.0, (512, 512))
boxes_flat = normalize_and_flatten(anchor_obj.boxes)

# Inside pipeline definition:
enc_bboxes, enc_classes = dali.fn.box_encoder(
    bboxes, classes, anchors=boxes_flat, offset=True
)

# Count positives for loss normalization
num_positives = dali.fn.reductions.sum(
    dali.fn.cast(enc_classes != 0, dtype=dali.types.FLOAT)
)

# Adjust class indexing (background becomes -1)
enc_classes -= 1

# Convert ltrb -> tlbr
enc_bboxes = dali.fn.coord_transform(
    enc_bboxes,
    M=[0, 1, 0, 0,  1, 0, 0, 0,  0, 0, 0, 1,  0, 0, 1, 0]
)

Related Pages

Implements Principle

Principle:NVIDIA_DALI_Anchor_Box_Encoding

Requires Environment

Environment:NVIDIA_DALI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment