Principle:Tencent Ncnn SSD Object Detection

Knowledge Sources	Tencent_Ncnn
Domains	Computer Vision, Object Detection
Last Updated	2026-02-09 19:00 GMT

Overview

A single-shot, anchor-based object detection framework that predicts class scores and bounding box offsets relative to a predefined set of anchor boxes at multiple feature map scales, unified through a DetectionOutput layer that applies confidence thresholding and non-maximum suppression.

Description

SSD (Single Shot MultiBox Detection) is a foundational object detection paradigm that performs both object localization and classification in a single forward pass through the network, without requiring a separate region proposal stage. This single-shot design makes SSD inherently faster than two-stage detectors while maintaining competitive accuracy.

The architecture begins with a backbone network (such as MobileNetV2, MobileNetV3, SqueezeNet, or VGG) that extracts hierarchical feature maps at progressively lower spatial resolutions. Additional convolutional layers may be appended to produce feature maps at even smaller scales. Detection is performed simultaneously on multiple feature map scales, allowing the network to detect objects of different sizes: large feature maps (early in the network) detect small objects, while small feature maps (deep in the network) detect large objects.

At each spatial location on each feature map, a set of anchor boxes (also called default boxes or priors) with predefined aspect ratios and scales is placed. For each anchor, the network predicts two outputs: (1) class confidence scores for each object category (plus a background class), and (2) bounding box offsets (center x, center y, width, height) relative to the anchor box. The total number of predictions is the sum over all feature maps of (spatial_locations x anchors_per_location).

The DetectionOutput layer performs the critical post-processing step. It first applies confidence thresholding to filter out low-scoring predictions, then applies non-maximum suppression (NMS) per class to eliminate duplicate detections of the same object. NMS works by iteratively selecting the highest-scoring detection and removing all other detections with Intersection-over-Union (IoU) above a threshold (typically 0.45). The surviving detections constitute the final output.

This same principle underpins not only the original SSD architecture but also many subsequent detector designs (including YOLO variants, Faster R-CNN, and R-FCN) that share the core concepts of anchor-based prediction, multi-scale detection, and NMS-based post-processing.

Usage

This principle applies in real-time object detection scenarios:

Mobile object detection: Running lightweight detection models on smartphones and embedded devices.
Surveillance systems: Detecting people, vehicles, and objects in video streams.
Autonomous driving: Identifying pedestrians, cars, signs, and obstacles.
Robotics: Enabling robots to perceive and interact with objects in their environment.
General-purpose detection: Detecting objects from standard benchmarks such as PASCAL VOC (20 classes) or COCO (80 classes).

Theoretical Basis

The anchor box generation and prediction pipeline:

// Multi-scale feature extraction
feature_maps = Backbone(image)
// e.g., feature_maps at scales: 19x19, 10x10, 5x5, 3x3, 2x2, 1x1

all_predictions = []

for fm in feature_maps:
    H, W = fm.spatial_size
    anchors = generate_anchors(H, W, aspect_ratios, scale)
    // anchors shape: (H * W * num_anchors, 4)

    // Predict class scores and box offsets
    class_scores = ClassHead(fm)     // shape: (H * W * num_anchors, num_classes + 1)
    box_offsets = BoxHead(fm)        // shape: (H * W * num_anchors, 4)

    all_predictions.append((anchors, class_scores, box_offsets))

Decoding predicted boxes from anchor offsets:

// Offset encoding (used during training)
tx = (gt_cx - anchor_cx) / anchor_w
ty = (gt_cy - anchor_cy) / anchor_h
tw = log(gt_w / anchor_w)
th = log(gt_h / anchor_h)

// Offset decoding (used during inference)
pred_cx = tx * anchor_w + anchor_cx
pred_cy = ty * anchor_h + anchor_cy
pred_w  = exp(tw) * anchor_w
pred_h  = exp(th) * anchor_h

The Non-Maximum Suppression (NMS) algorithm:

function NMS(detections, iou_threshold):
    // Sort detections by confidence score (descending)
    detections = sort_by_score(detections, descending=True)
    keep = []

    while detections is not empty:
        best = detections.pop_first()    // highest scoring detection
        keep.append(best)

        // Remove all detections with high overlap
        remaining = []
        for det in detections:
            if IoU(best.box, det.box) < iou_threshold:
                remaining.append(det)   // keep: low overlap
            // else: suppress (discard)
        detections = remaining

    return keep

The Intersection over Union (IoU) metric:

$I o U (A, B) = \frac{| A \cap B |}{| A \cup B |} = \frac{area of overlap}{area of union}$

The complete DetectionOutput post-processing:

function DetectionOutput(all_anchors, all_scores, all_offsets,
                         confidence_threshold, nms_threshold, top_k):
    // Concatenate predictions from all scales
    anchors = concatenate(all_anchors)
    scores = concatenate(all_scores)
    offsets = concatenate(all_offsets)

    // Decode boxes
    boxes = decode_boxes(anchors, offsets)

    // Per-class processing
    final_detections = []
    for class_id in range(1, num_classes):    // skip background class 0
        class_scores = scores[:, class_id]
        mask = class_scores > confidence_threshold
        filtered_boxes = boxes[mask]
        filtered_scores = class_scores[mask]

        // Apply NMS per class
        kept = NMS(zip(filtered_boxes, filtered_scores), nms_threshold)
        final_detections.extend(kept)

    // Keep only top_k highest scoring detections overall
    final_detections = top_k_by_score(final_detections, top_k)
    return final_detections

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment