Principle:Tencent Ncnn Instance Segmentation

Knowledge Sources	Tencent_Ncnn
Domains	Computer_Vision, Instance_Segmentation
Last Updated	2026-02-09 19:00 GMT

Overview

Post-processing pipeline that combines object detection outputs with per-instance mask coefficient generation to produce pixel-level segmentation for each detected object.

Description

Instance segmentation extends object detection by producing not only a bounding box and class label for each detected object but also a binary mask that delineates the precise pixel boundary of that object within the image. The post-processing pipeline typically operates in two stages:

Stage 1: Detection decoding. The network outputs a tensor of shape [N, D] where each of the N anchor/grid positions produces D values: bounding box regression parameters (typically 4 x 16 DFL bins), per-class confidence scores, and mask coefficients (typically 32 floats). Standard bounding box decoding converts the DFL distribution over bins into four coordinate offsets, then applies grid-cell offsets and stride scaling to produce absolute bounding box coordinates. Confidence thresholding and Non-Maximum Suppression filter the detections.

Stage 2: Mask generation. The network also outputs a prototype mask tensor of shape [32, H_mask, W_mask] (e.g., 32 x 160 x 160). For each surviving detection, its 32 mask coefficients are used as linear combination weights over the 32 prototype masks. The resulting single-channel mask is cropped to the detection's bounding box region, thresholded at 0.5 after sigmoid activation, and resized to the original bounding box dimensions to produce the final per-instance binary mask.

This two-stage approach (prototypes + coefficients) was introduced by YOLACT and adopted by YOLOv8-seg and YOLO11-seg, offering a good balance between mask quality and computational cost.

Usage

Apply this principle when the application requires not just object locations but precise object boundaries -- for example, image editing (background removal), augmented reality (object occlusion), robotics (grasp planning), or medical imaging (lesion delineation).

Theoretical Basis

Mask generation from prototypes and coefficients: ${mask}_{i} (x, y) = σ (\sum_{k = 1}^{32} c_{i, k} \cdot P_{k} (x, y))$

where $c_{i, k}$ are the 32 mask coefficients for detection i, $P_{k}$ is the k-th prototype mask, and $σ$ is the sigmoid function.

Detection output tensor layout (YOLO11-seg):

Tensor shape: [N_boxes, 176]
  Columns  0..63  : bbox DFL regression (4 sides x 16 bins)
  Columns 64..143 : per-class scores (80 classes)
  Columns 144..175: mask coefficients (32 values)

Prototype tensor: [32, 160, 160]

Post-processing algorithm:

1. For each grid cell i at stride s with offset (cx, cy):
   a. Decode DFL bins -> (x0, y0, x1, y1) relative offsets
   b. box = (cx + 0.5 - x0) * s, (cy + 0.5 - y0) * s,
            (cx + 0.5 + x1) * s, (cy + 0.5 + y1) * s
   c. Extract class scores, apply threshold
   d. Extract 32 mask coefficients
2. Apply NMS to filtered detections
3. For each surviving detection:
   a. mask = sigmoid(coefficients @ prototypes)  -- [160, 160]
   b. Crop mask to bounding box region
   c. Resize cropped mask to box dimensions
   d. Threshold at 0.5 for binary mask

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment