Principle:NVIDIA DALI Anchor Box Encoding
| Knowledge Sources | |
|---|---|
| Domains | Object_Detection, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Anchor box encoding is the process of converting variable-length ground-truth bounding boxes and class labels into fixed-size, per-anchor regression targets and classification labels that a detection network can learn from.
Description
Anchor Box Encoding bridges the gap between raw annotations and the format expected by anchor-based object detectors such as EfficientDet, SSD, and RetinaNet. These detectors predict offsets relative to a fixed set of pre-defined anchor boxes (also called default boxes or priors) placed at regular positions across multiple feature map levels.
The encoding process involves several stages:
- Anchor generation: Multi-scale anchor boxes are pre-computed based on the feature pyramid levels (e.g., levels 3 through 7), the number of scales per level, aspect ratios, and an anchor scale factor. Each anchor is defined by its center position and dimensions, then normalized to [0, 1] relative to the input image size.
- Matching: Each ground-truth box is matched to the anchor with the highest IoU (Intersection over Union). Additionally, each anchor is matched to the ground-truth box with the highest IoU above a threshold. Unmatched anchors are labeled as background (negative).
- Offset encoding: For each matched anchor-ground truth pair, the regression target is computed as the offset between the ground-truth box center/size and the anchor box center/size:
- dx = (gt_cx - anchor_cx) / anchor_w
- dy = (gt_cy - anchor_cy) / anchor_h
- dw = log(gt_w / anchor_w)
- dh = log(gt_h / anchor_h)
- Per-level reshaping: The flat vector of encoded targets is split and reshaped into per-level tensors matching the spatial dimensions of each feature map level, so that the loss function can be computed per level.
- Coordinate conversion: The encoded boxes may need to be converted between coordinate conventions (e.g., from ltrb to tlbr) to match the format expected by the detection head.
- Padding: The variable-length raw ground-truth boxes and classes are padded to a fixed size (max_instances_per_image) with a fill value of -1 for use in auxiliary loss computations.
Usage
Use this principle when implementing the label encoding stage of any anchor-based object detector. It applies whenever ground-truth annotations must be converted to per-anchor targets for training.
Theoretical Basis
Given a set of A anchor boxes and N ground-truth boxes, the encoding proceeds as follows:
IoU Matching:
For each anchor a_i (i = 1..A):
matched_gt = argmax_j IoU(a_i, gt_j) for j = 1..N
if IoU(a_i, gt_{matched_gt}) >= threshold:
class_target[i] = class[matched_gt]
bbox_target[i] = encode(gt_{matched_gt}, a_i)
else:
class_target[i] = background (0 or -1)
bbox_target[i] = 0
Offset Encoding:
encode(gt, anchor):
tx = (gt_cx - anchor_cx) / anchor_w
ty = (gt_cy - anchor_cy) / anchor_h
tw = log(gt_w / anchor_w)
th = log(gt_h / anchor_h)
return (tx, ty, tw, th)
Per-Level Reshaping:
For level L from min_level to max_level:
feat_h = feat_sizes[L]["height"]
feat_w = feat_sizes[L]["width"]
anchors_per_loc = num_scales * len(aspect_ratios)
bbox_targets_L = reshape(bbox_targets[offset:offset+steps], [feat_h, feat_w, anchors_per_loc * 4])
class_targets_L = reshape(class_targets[offset:offset+steps], [feat_h, feat_w, anchors_per_loc])
offset += steps
The number of positive anchors (num_positives) is computed as the count of anchors with non-background class assignments and is used to normalize the detection loss.