Implementation:NVIDIA DALI Fn Box Encoder
| Knowledge Sources | |
|---|---|
| Domains | Object_Detection, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete anchor box encoding pipeline stage using dali.fn.box_encoder, dali.fn.coord_transform, dali.fn.reshape, and dali.fn.pad for producing per-level detection training targets, provided by the DALI EfficientDet pipeline.
Description
The anchor box encoding stage in the EfficientDet DALI pipeline converts variable-length ground-truth bounding boxes and class labels into the fixed-size, per-anchor, per-level targets required by the EfficientDet detection head. This is implemented as a sequence of DALI operations within the _define_pipeline and _unpack_labels methods of EfficientDetPipeline.
The encoding process consists of:
- Box encoding (dali.fn.box_encoder): Matches each ground-truth box to the closest pre-computed anchor (using IoU), then encodes the matched boxes as regression offsets. The anchors parameter is a flat list of normalized anchor coordinates in ltrb format. The offset=True flag enables offset encoding rather than direct coordinate output. Returns (enc_bboxes, enc_classes) where enc_bboxes has shape [A, 4] and enc_classes has shape [A], with A being the total number of anchors.
- Positive count: The number of positive (non-background) anchors is computed as the sum of a cast of enc_classes != 0 to float. Class labels are then decremented by 1 (enc_classes -= 1) so that background becomes -1 and class indices start at 0.
- Coordinate conversion (dali.fn.coord_transform): The encoded boxes are transformed from ltrb to tlbr order using a 4x4 permutation matrix M that swaps the x and y coordinates.
- Per-level unpacking (_unpack_labels): The flat [A, 4] and [A] tensors are split by feature pyramid level (levels 3-7) and reshaped into [feat_h, feat_w, anchors_per_loc * 4] for boxes and [feat_h, feat_w, anchors_per_loc] for classes.
- Padding (dali.fn.pad): The raw (unencoded) ground-truth boxes and classes are padded to (max_instances_per_image, 4) and (max_instances_per_image,) respectively, with fill value -1, for use in auxiliary computations.
The pre-computed anchors are generated by the Anchors class with min_level=3, max_level=7, num_scales=3, aspect_ratios=[1.0, 2.0, 0.5], and anchor_scale=4.0.
Usage
This encoding is performed automatically within the EfficientDetPipeline._define_pipeline method. It is not called directly by users but can be understood as the label-encoding stage that runs after all image augmentations.
Code Reference
Source Location
- Repository: NVIDIA DALI
- File: docs/examples/use_cases/tensorflow/efficientdet/pipeline/dali/efficientdet_pipeline.py (lines 128-171)
Signature
# Box encoding within _define_pipeline:
enc_bboxes, enc_classes = dali.fn.box_encoder(
bboxes, classes, anchors=self._boxes, offset=True
)
# Coordinate transform (ltrb -> tlbr):
enc_bboxes = dali.fn.coord_transform(
enc_bboxes,
M=[0, 1, 0, 0,
1, 0, 0, 0,
0, 0, 0, 1,
0, 0, 1, 0]
)
# Per-level reshape:
dali.fn.reshape(
enc_bboxes[count:count + steps, 0:4],
shape=[feat_h, feat_w, -1],
)
# Padding:
dali.fn.pad(bboxes, fill_value=-1, shape=(max_instances, 4))
dali.fn.pad(classes, fill_value=-1, shape=(max_instances,))
Import
import nvidia.dali as dali
from pipeline import anchors
# Anchor pre-computation:
anchor_obj = anchors.Anchors(3, 7, 3, [1.0, 2.0, 0.5], 4.0, image_size)
boxes = anchor_obj.boxes # shape [A, 4]
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| bboxes | DALI TensorList | Yes | Ground-truth bounding boxes in normalized ltrb format, shape [N, 4] per sample. |
| classes | DALI TensorList | Yes | Ground-truth class labels, shape [N] per sample (1-indexed, where 0 is reserved for background). |
| anchors | list[float] | Yes | Flat list of pre-computed anchor box coordinates in normalized ltrb format, length A * 4. |
| offset | bool | Yes | When True, encodes regression offsets rather than raw coordinates. |
| fill_value | int/float | Yes (pad) | Value used to pad absent ground-truth entries. Typically -1. |
| shape | tuple | Yes (pad) | Target shape for padded output: (max_instances_per_image, 4) for boxes, (max_instances_per_image,) for classes. |
Outputs
| Name | Type | Description |
|---|---|---|
| enc_bboxes_layers | list[DALI TensorList] | Per-level encoded bbox regression targets, each of shape [feat_h, feat_w, anchors_per_loc * 4]. |
| enc_classes_layers | list[DALI TensorList] | Per-level encoded class targets, each of shape [feat_h, feat_w, anchors_per_loc]. |
| num_positives | DALI TensorList | Scalar float32 count of positive (non-background) anchors per sample. |
| bboxes (padded) | DALI TensorList | Padded ground-truth boxes, shape (max_instances_per_image, 4), with -1 fill. |
| classes (padded) | DALI TensorList | Padded ground-truth classes, shape (max_instances_per_image,), with -1 fill. |
Usage Examples
Anchor Encoding Within the Pipeline
import nvidia.dali as dali
from pipeline import anchors
# Pre-compute anchors
anchor_obj = anchors.Anchors(3, 7, 3, [1.0, 2.0, 0.5], 4.0, (512, 512))
boxes_flat = normalize_and_flatten(anchor_obj.boxes)
# Inside pipeline definition:
enc_bboxes, enc_classes = dali.fn.box_encoder(
bboxes, classes, anchors=boxes_flat, offset=True
)
# Count positives for loss normalization
num_positives = dali.fn.reductions.sum(
dali.fn.cast(enc_classes != 0, dtype=dali.types.FLOAT)
)
# Adjust class indexing (background becomes -1)
enc_classes -= 1
# Convert ltrb -> tlbr
enc_bboxes = dali.fn.coord_transform(
enc_bboxes,
M=[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]
)