Principle:Tencent Ncnn SSD Object Detection
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Object Detection |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
A single-shot, anchor-based object detection framework that predicts class scores and bounding box offsets relative to a predefined set of anchor boxes at multiple feature map scales, unified through a DetectionOutput layer that applies confidence thresholding and non-maximum suppression.
Description
SSD (Single Shot MultiBox Detection) is a foundational object detection paradigm that performs both object localization and classification in a single forward pass through the network, without requiring a separate region proposal stage. This single-shot design makes SSD inherently faster than two-stage detectors while maintaining competitive accuracy.
The architecture begins with a backbone network (such as MobileNetV2, MobileNetV3, SqueezeNet, or VGG) that extracts hierarchical feature maps at progressively lower spatial resolutions. Additional convolutional layers may be appended to produce feature maps at even smaller scales. Detection is performed simultaneously on multiple feature map scales, allowing the network to detect objects of different sizes: large feature maps (early in the network) detect small objects, while small feature maps (deep in the network) detect large objects.
At each spatial location on each feature map, a set of anchor boxes (also called default boxes or priors) with predefined aspect ratios and scales is placed. For each anchor, the network predicts two outputs: (1) class confidence scores for each object category (plus a background class), and (2) bounding box offsets (center x, center y, width, height) relative to the anchor box. The total number of predictions is the sum over all feature maps of (spatial_locations x anchors_per_location).
The DetectionOutput layer performs the critical post-processing step. It first applies confidence thresholding to filter out low-scoring predictions, then applies non-maximum suppression (NMS) per class to eliminate duplicate detections of the same object. NMS works by iteratively selecting the highest-scoring detection and removing all other detections with Intersection-over-Union (IoU) above a threshold (typically 0.45). The surviving detections constitute the final output.
This same principle underpins not only the original SSD architecture but also many subsequent detector designs (including YOLO variants, Faster R-CNN, and R-FCN) that share the core concepts of anchor-based prediction, multi-scale detection, and NMS-based post-processing.
Usage
This principle applies in real-time object detection scenarios:
- Mobile object detection: Running lightweight detection models on smartphones and embedded devices.
- Surveillance systems: Detecting people, vehicles, and objects in video streams.
- Autonomous driving: Identifying pedestrians, cars, signs, and obstacles.
- Robotics: Enabling robots to perceive and interact with objects in their environment.
- General-purpose detection: Detecting objects from standard benchmarks such as PASCAL VOC (20 classes) or COCO (80 classes).
Theoretical Basis
The anchor box generation and prediction pipeline:
// Multi-scale feature extraction
feature_maps = Backbone(image)
// e.g., feature_maps at scales: 19x19, 10x10, 5x5, 3x3, 2x2, 1x1
all_predictions = []
for fm in feature_maps:
H, W = fm.spatial_size
anchors = generate_anchors(H, W, aspect_ratios, scale)
// anchors shape: (H * W * num_anchors, 4)
// Predict class scores and box offsets
class_scores = ClassHead(fm) // shape: (H * W * num_anchors, num_classes + 1)
box_offsets = BoxHead(fm) // shape: (H * W * num_anchors, 4)
all_predictions.append((anchors, class_scores, box_offsets))
Decoding predicted boxes from anchor offsets:
// Offset encoding (used during training)
tx = (gt_cx - anchor_cx) / anchor_w
ty = (gt_cy - anchor_cy) / anchor_h
tw = log(gt_w / anchor_w)
th = log(gt_h / anchor_h)
// Offset decoding (used during inference)
pred_cx = tx * anchor_w + anchor_cx
pred_cy = ty * anchor_h + anchor_cy
pred_w = exp(tw) * anchor_w
pred_h = exp(th) * anchor_h
The Non-Maximum Suppression (NMS) algorithm:
function NMS(detections, iou_threshold):
// Sort detections by confidence score (descending)
detections = sort_by_score(detections, descending=True)
keep = []
while detections is not empty:
best = detections.pop_first() // highest scoring detection
keep.append(best)
// Remove all detections with high overlap
remaining = []
for det in detections:
if IoU(best.box, det.box) < iou_threshold:
remaining.append(det) // keep: low overlap
// else: suppress (discard)
detections = remaining
return keep
The Intersection over Union (IoU) metric:
The complete DetectionOutput post-processing:
function DetectionOutput(all_anchors, all_scores, all_offsets,
confidence_threshold, nms_threshold, top_k):
// Concatenate predictions from all scales
anchors = concatenate(all_anchors)
scores = concatenate(all_scores)
offsets = concatenate(all_offsets)
// Decode boxes
boxes = decode_boxes(anchors, offsets)
// Per-class processing
final_detections = []
for class_id in range(1, num_classes): // skip background class 0
class_scores = scores[:, class_id]
mask = class_scores > confidence_threshold
filtered_boxes = boxes[mask]
filtered_scores = class_scores[mask]
// Apply NMS per class
kept = NMS(zip(filtered_boxes, filtered_scores), nms_threshold)
final_detections.extend(kept)
// Keep only top_k highest scoring detections overall
final_detections = top_k_by_score(final_detections, top_k)
return final_detections
Related Pages
- Implementation:Tencent_Ncnn_MobileNetV2_SSDLite_Example
- Implementation:Tencent_Ncnn_MobileNetV3_SSDLite_Example
- Implementation:Tencent_Ncnn_SqueezeNetSSD_Example
- Implementation:Tencent_Ncnn_Fasterrcnn_Example
- Implementation:Tencent_Ncnn_RFCN_Example
- Implementation:Tencent_Ncnn_YOLOv2_Example
- Implementation:Tencent_Ncnn_YOLOv3_Example
- Implementation:Tencent_Ncnn_YOLOv4_Example
- Implementation:Tencent_Ncnn_YOLOX_Example
- Implementation:Tencent_Ncnn_YOLO_World_Example