Principle:NVIDIA DALI Anchor Box Encoding

Knowledge Sources	NVIDIA DALI Documentation
Domains	Object_Detection, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

Anchor box encoding is the process of converting variable-length ground-truth bounding boxes and class labels into fixed-size, per-anchor regression targets and classification labels that a detection network can learn from.

Description

Anchor Box Encoding bridges the gap between raw annotations and the format expected by anchor-based object detectors such as EfficientDet, SSD, and RetinaNet. These detectors predict offsets relative to a fixed set of pre-defined anchor boxes (also called default boxes or priors) placed at regular positions across multiple feature map levels.

The encoding process involves several stages:

Anchor generation: Multi-scale anchor boxes are pre-computed based on the feature pyramid levels (e.g., levels 3 through 7), the number of scales per level, aspect ratios, and an anchor scale factor. Each anchor is defined by its center position and dimensions, then normalized to [0, 1] relative to the input image size.

Matching: Each ground-truth box is matched to the anchor with the highest IoU (Intersection over Union). Additionally, each anchor is matched to the ground-truth box with the highest IoU above a threshold. Unmatched anchors are labeled as background (negative).

Offset encoding: For each matched anchor-ground truth pair, the regression target is computed as the offset between the ground-truth box center/size and the anchor box center/size:

- dx = (gt_cx - anchor_cx) / anchor_w
- dy = (gt_cy - anchor_cy) / anchor_h
- dw = log(gt_w / anchor_w)
- dh = log(gt_h / anchor_h)

Per-level reshaping: The flat vector of encoded targets is split and reshaped into per-level tensors matching the spatial dimensions of each feature map level, so that the loss function can be computed per level.

Coordinate conversion: The encoded boxes may need to be converted between coordinate conventions (e.g., from ltrb to tlbr) to match the format expected by the detection head.

Padding: The variable-length raw ground-truth boxes and classes are padded to a fixed size (max_instances_per_image) with a fill value of -1 for use in auxiliary loss computations.

Usage

Use this principle when implementing the label encoding stage of any anchor-based object detector. It applies whenever ground-truth annotations must be converted to per-anchor targets for training.

Theoretical Basis

Given a set of A anchor boxes and N ground-truth boxes, the encoding proceeds as follows:

IoU Matching:

For each anchor a_i (i = 1..A):
    matched_gt = argmax_j IoU(a_i, gt_j)  for j = 1..N
    if IoU(a_i, gt_{matched_gt}) >= threshold:
        class_target[i] = class[matched_gt]
        bbox_target[i] = encode(gt_{matched_gt}, a_i)
    else:
        class_target[i] = background (0 or -1)
        bbox_target[i] = 0

Offset Encoding:

encode(gt, anchor):
    tx = (gt_cx - anchor_cx) / anchor_w
    ty = (gt_cy - anchor_cy) / anchor_h
    tw = log(gt_w / anchor_w)
    th = log(gt_h / anchor_h)
    return (tx, ty, tw, th)

Per-Level Reshaping:

For level L from min_level to max_level:
    feat_h = feat_sizes[L]["height"]
    feat_w = feat_sizes[L]["width"]
    anchors_per_loc = num_scales * len(aspect_ratios)
    bbox_targets_L = reshape(bbox_targets[offset:offset+steps], [feat_h, feat_w, anchors_per_loc * 4])
    class_targets_L = reshape(class_targets[offset:offset+steps], [feat_h, feat_w, anchors_per_loc])
    offset += steps

The number of positive anchors (num_positives) is computed as the count of anchors with non-background class assignments and is used to normalize the detection loss.

Related Pages

Implemented By

Implementation:NVIDIA_DALI_Fn_Box_Encoder

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment