Implementation:Tencent Ncnn YOLO11 Seg Example
| Knowledge Sources | |
|---|---|
| Domains | Vision, Instance_Segmentation |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Concrete tool for instance segmentation using YOLO11 with ncnn.
Description
This example implements YOLO11 instance segmentation using the ncnn inference framework, detecting objects with both bounding boxes and per-instance pixel masks. The model produces three output blobs: a detection blob (w=176, h=8400) containing DFL bbox regression (16x4=64 values) and per-class scores (80 COCO classes), a mask coefficient blob (w=32, h=8400) with 32 mask coefficients per detection, and prototype masks (32x160x160). Instance masks are generated by matrix multiplication of mask coefficients with prototype masks, followed by sigmoid activation and cropping to the bounding box region. Input images are preprocessed with letterbox padding to 640x640 resolution.
Usage
Use this example when you need pixel-level object segmentation in addition to bounding box detection. YOLO11-Seg provides fast instance segmentation suitable for applications like autonomous driving scene understanding, robotic manipulation, or image editing on edge devices.
Code Reference
Source Location
- Repository: Tencent_Ncnn
- File: examples/yolo11_seg.cpp
- Lines: 1-644
Signature
struct Object
{
cv::Rect_<float> rect;
int label;
float prob;
int gindex;
cv::Mat mask;
};
static int detect_yolo11_seg(const cv::Mat& bgr, std::vector<Object>& objects);
static void generate_proposals(int stride, const ncnn::Mat& pred,
const ncnn::Mat& pred_mask,
float prob_threshold, std::vector<Object>& objects);
static void qsort_descent_inplace(std::vector<Object>& objects);
static void nms_sorted_bboxes(const std::vector<Object>& objects,
std::vector<int>& picked, float nms_threshold,
bool agnostic = false);
Import
#include "layer.h"
#include "net.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image_path | const char* | Yes | Path to input image file |
Outputs
| Name | Type | Description |
|---|---|---|
| objects | std::vector<Object> | Detected objects with bounding boxes, class labels, confidence scores, and per-instance binary masks (cv::Mat) |
Model Files
| File | Description |
|---|---|
| yolo11n_seg.ncnn.param | YOLO11-Seg nano model parameter file |
| yolo11n_seg.ncnn.bin | YOLO11-Seg nano model weight file |
Usage Examples
Running the Example
./yolo11_seg image.jpg
Key Code Pattern
ncnn::Net yolo11;
yolo11.opt.use_vulkan_compute = true;
yolo11.load_param("yolo11n_seg.ncnn.param");
yolo11.load_model("yolo11n_seg.ncnn.bin");
const int target_size = 640;
const float prob_threshold = 0.25f;
const float nms_threshold = 0.45f;
const float mask_threshold = 0.5f;
// Letterbox pad to 640x640
ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data,
ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);
const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};
in_pad.substract_mean_normalize(0, norm_vals);
ncnn::Extractor ex = yolo11.create_extractor();
ex.input("in0", in_pad);
ncnn::Mat out0; // bbox + class scores (w=176, h=8400)
ncnn::Mat out1; // mask coefficients (w=32, h=8400)
ncnn::Mat out2; // prototype masks (32x160x160)
ex.extract("out0", out0);
ex.extract("out1", out1);
ex.extract("out2", out2);
Implementation Details
Preprocessing
Input images are resized while preserving aspect ratio and letterbox padded to 640x640 (a multiple of max_stride=32). Pixel values are converted from BGR to RGB and normalized by dividing by 255. The padding fill value is 114.
Output Tensor Layout
The model produces three output tensors:
- out0 (w=176, h=8400): DFL bbox regression (64 values) + 80 class scores + 32 mask coefficients per candidate
- out1 (w=32, h=8400): 32 mask coefficients per candidate box
- out2 (32x160x160): 32 prototype mask channels at 1/4 input resolution
Mask Generation Pipeline
- Generate detection proposals from out0 using DFL bbox decoding
- Apply NMS to filter overlapping detections
- For each surviving detection, compute mask = sigmoid(coefficients * prototype_masks)
- Crop mask to the detection's bounding box region
- Threshold at 0.5 to produce binary mask
- Overlay colored masks on the output image
Model Conversion
Models are converted from Ultralytics format using PNNX with modifications for dynamic shape inference, including reshaping output concatenation and area attention layers for variable input sizes.