Implementation:Tencent Ncnn YOLO11 Pose Example

Knowledge Sources	Tencent_Ncnn
Domains	Vision, Pose_Estimation
Last Updated	2026-02-09 19:00 GMT

Overview

Concrete tool for human pose estimation with 17 keypoints using YOLO11 with ncnn.

Description

This example implements YOLO11 pose estimation using the ncnn inference framework, detecting persons and localizing 17 COCO body keypoints (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles) in a single forward pass. The model produces two output blobs: a detection blob (w=65, h=8400) containing DFL bbox regression (16x4=64 values) and a person confidence score, and a keypoint blob (w=51, h=8400) containing 17 keypoints with 3 values each (x, y, confidence). Input images are preprocessed with letterbox padding to 640x640 resolution. The implementation draws skeleton connections between detected keypoints on the output visualization.

Usage

Use this example when you need to detect human body poses in images, such as for fitness tracking, gesture recognition, action recognition, or sports analysis applications running on mobile or edge devices.

Code Reference

Source Location

Repository: Tencent_Ncnn
File: examples/yolo11_pose.cpp
Lines: 1-581

Signature

struct KeyPoint
{
    cv::Point2f p;
    float prob;
};

struct Object
{
    cv::Rect_<float> rect;
    int label;
    float prob;
    std::vector<KeyPoint> keypoints;
};

static int detect_yolo11_pose(const cv::Mat& bgr, std::vector<Object>& objects);

static void generate_proposals(int stride, const ncnn::Mat& pred,
                               const ncnn::Mat& pred_kps,
                               float prob_threshold, std::vector<Object>& objects);
static void qsort_descent_inplace(std::vector<Object>& objects);
static void nms_sorted_bboxes(const std::vector<Object>& objects,
                               std::vector<int>& picked, float nms_threshold,
                               bool agnostic = false);

Import

#include "layer.h"
#include "net.h"

I/O Contract

Inputs

Name	Type	Required	Description
image_path	const char*	Yes	Path to input image file

Outputs

Name	Type	Description
objects	std::vector<Object>	Detected persons with bounding boxes, confidence scores, and 17 keypoints each with (x, y, confidence)

Model Files

File	Description
yolo11n_pose.ncnn.param	YOLO11-Pose nano model parameter file
yolo11n_pose.ncnn.bin	YOLO11-Pose nano model weight file

Usage Examples

Running the Example

./yolo11_pose image.jpg

Key Code Pattern

ncnn::Net yolo11;
yolo11.opt.use_vulkan_compute = true;

yolo11.load_param("yolo11n_pose.ncnn.param");
yolo11.load_model("yolo11n_pose.ncnn.bin");

const int target_size = 640;
const float prob_threshold = 0.25f;
const float nms_threshold = 0.45f;

// Letterbox pad to 640x640
ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data,
    ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);

const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};
in_pad.substract_mean_normalize(0, norm_vals);

ncnn::Extractor ex = yolo11.create_extractor();
ex.input("in0", in_pad);

ncnn::Mat out0;  // bbox + person score (w=65, h=8400)
ncnn::Mat out1;  // keypoints (w=51, h=8400)
ex.extract("out0", out0);
ex.extract("out1", out1);

Implementation Details

Preprocessing

Input images are resized while preserving aspect ratio and letterbox padded to 640x640 (a multiple of max_stride=32). Pixel values are converted from BGR to RGB and normalized by dividing by 255. The padding fill value is 114.

Output Tensor Layout

The model produces two output tensors:

out0 (w=65, h=8400): Contains DFL bbox regression (16x4=64 values) and 1 person confidence score for 8400 candidate boxes across three stride levels (8, 16, 32)
out1 (w=51, h=8400): Contains 17 keypoints x 3 values (x, y, confidence) per candidate box

Keypoint Format

Each detected person has 17 COCO keypoints: nose (0), left eye (1), right eye (2), left ear (3), right ear (4), left shoulder (5), right shoulder (6), left elbow (7), right elbow (8), left wrist (9), right wrist (10), left hip (11), right hip (12), left knee (13), right knee (14), left ankle (15), right ankle (16). Skeleton connections are drawn between anatomically related keypoints in the visualization.

Model Conversion

Models are converted from Ultralytics format using PNNX with modifications for dynamic shape inference. The conversion requires dual input shapes (640x640 and 320x320) and modifications to area attention layers.

Related Pages

Environment:Tencent_Ncnn_Build_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment