Implementation:Tencent Ncnn YOLO11 Pose Example
| Knowledge Sources | |
|---|---|
| Domains | Vision, Pose_Estimation |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Concrete tool for human pose estimation with 17 keypoints using YOLO11 with ncnn.
Description
This example implements YOLO11 pose estimation using the ncnn inference framework, detecting persons and localizing 17 COCO body keypoints (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles) in a single forward pass. The model produces two output blobs: a detection blob (w=65, h=8400) containing DFL bbox regression (16x4=64 values) and a person confidence score, and a keypoint blob (w=51, h=8400) containing 17 keypoints with 3 values each (x, y, confidence). Input images are preprocessed with letterbox padding to 640x640 resolution. The implementation draws skeleton connections between detected keypoints on the output visualization.
Usage
Use this example when you need to detect human body poses in images, such as for fitness tracking, gesture recognition, action recognition, or sports analysis applications running on mobile or edge devices.
Code Reference
Source Location
- Repository: Tencent_Ncnn
- File: examples/yolo11_pose.cpp
- Lines: 1-581
Signature
struct KeyPoint
{
cv::Point2f p;
float prob;
};
struct Object
{
cv::Rect_<float> rect;
int label;
float prob;
std::vector<KeyPoint> keypoints;
};
static int detect_yolo11_pose(const cv::Mat& bgr, std::vector<Object>& objects);
static void generate_proposals(int stride, const ncnn::Mat& pred,
const ncnn::Mat& pred_kps,
float prob_threshold, std::vector<Object>& objects);
static void qsort_descent_inplace(std::vector<Object>& objects);
static void nms_sorted_bboxes(const std::vector<Object>& objects,
std::vector<int>& picked, float nms_threshold,
bool agnostic = false);
Import
#include "layer.h"
#include "net.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image_path | const char* | Yes | Path to input image file |
Outputs
| Name | Type | Description |
|---|---|---|
| objects | std::vector<Object> | Detected persons with bounding boxes, confidence scores, and 17 keypoints each with (x, y, confidence) |
Model Files
| File | Description |
|---|---|
| yolo11n_pose.ncnn.param | YOLO11-Pose nano model parameter file |
| yolo11n_pose.ncnn.bin | YOLO11-Pose nano model weight file |
Usage Examples
Running the Example
./yolo11_pose image.jpg
Key Code Pattern
ncnn::Net yolo11;
yolo11.opt.use_vulkan_compute = true;
yolo11.load_param("yolo11n_pose.ncnn.param");
yolo11.load_model("yolo11n_pose.ncnn.bin");
const int target_size = 640;
const float prob_threshold = 0.25f;
const float nms_threshold = 0.45f;
// Letterbox pad to 640x640
ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data,
ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);
const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};
in_pad.substract_mean_normalize(0, norm_vals);
ncnn::Extractor ex = yolo11.create_extractor();
ex.input("in0", in_pad);
ncnn::Mat out0; // bbox + person score (w=65, h=8400)
ncnn::Mat out1; // keypoints (w=51, h=8400)
ex.extract("out0", out0);
ex.extract("out1", out1);
Implementation Details
Preprocessing
Input images are resized while preserving aspect ratio and letterbox padded to 640x640 (a multiple of max_stride=32). Pixel values are converted from BGR to RGB and normalized by dividing by 255. The padding fill value is 114.
Output Tensor Layout
The model produces two output tensors:
- out0 (w=65, h=8400): Contains DFL bbox regression (16x4=64 values) and 1 person confidence score for 8400 candidate boxes across three stride levels (8, 16, 32)
- out1 (w=51, h=8400): Contains 17 keypoints x 3 values (x, y, confidence) per candidate box
Keypoint Format
Each detected person has 17 COCO keypoints: nose (0), left eye (1), right eye (2), left ear (3), right ear (4), left shoulder (5), right shoulder (6), left elbow (7), right elbow (8), left wrist (9), right wrist (10), left hip (11), right hip (12), left knee (13), right knee (14), left ankle (15), right ankle (16). Skeleton connections are drawn between anatomically related keypoints in the visualization.
Model Conversion
Models are converted from Ultralytics format using PNNX with modifications for dynamic shape inference. The conversion requires dual input shapes (640x640 and 320x320) and modifications to area attention layers.