Principle:Tencent Ncnn Instance Segmentation
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Instance_Segmentation |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Post-processing pipeline that combines object detection outputs with per-instance mask coefficient generation to produce pixel-level segmentation for each detected object.
Description
Instance segmentation extends object detection by producing not only a bounding box and class label for each detected object but also a binary mask that delineates the precise pixel boundary of that object within the image. The post-processing pipeline typically operates in two stages:
Stage 1: Detection decoding. The network outputs a tensor of shape [N, D] where each of the N anchor/grid positions produces D values: bounding box regression parameters (typically 4 x 16 DFL bins), per-class confidence scores, and mask coefficients (typically 32 floats). Standard bounding box decoding converts the DFL distribution over bins into four coordinate offsets, then applies grid-cell offsets and stride scaling to produce absolute bounding box coordinates. Confidence thresholding and Non-Maximum Suppression filter the detections.
Stage 2: Mask generation. The network also outputs a prototype mask tensor of shape [32, H_mask, W_mask] (e.g., 32 x 160 x 160). For each surviving detection, its 32 mask coefficients are used as linear combination weights over the 32 prototype masks. The resulting single-channel mask is cropped to the detection's bounding box region, thresholded at 0.5 after sigmoid activation, and resized to the original bounding box dimensions to produce the final per-instance binary mask.
This two-stage approach (prototypes + coefficients) was introduced by YOLACT and adopted by YOLOv8-seg and YOLO11-seg, offering a good balance between mask quality and computational cost.
Usage
Apply this principle when the application requires not just object locations but precise object boundaries -- for example, image editing (background removal), augmented reality (object occlusion), robotics (grasp planning), or medical imaging (lesion delineation).
Theoretical Basis
Mask generation from prototypes and coefficients:
where are the 32 mask coefficients for detection i, is the k-th prototype mask, and is the sigmoid function.
Detection output tensor layout (YOLO11-seg):
Tensor shape: [N_boxes, 176]
Columns 0..63 : bbox DFL regression (4 sides x 16 bins)
Columns 64..143 : per-class scores (80 classes)
Columns 144..175: mask coefficients (32 values)
Prototype tensor: [32, 160, 160]
Post-processing algorithm:
1. For each grid cell i at stride s with offset (cx, cy):
a. Decode DFL bins -> (x0, y0, x1, y1) relative offsets
b. box = (cx + 0.5 - x0) * s, (cy + 0.5 - y0) * s,
(cx + 0.5 + x1) * s, (cy + 0.5 + y1) * s
c. Extract class scores, apply threshold
d. Extract 32 mask coefficients
2. Apply NMS to filtered detections
3. For each surviving detection:
a. mask = sigmoid(coefficients @ prototypes) -- [160, 160]
b. Crop mask to bounding box region
c. Resize cropped mask to box dimensions
d. Threshold at 0.5 for binary mask