Principle:Tencent Ncnn Keypoint Pose Estimation
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Human_Pose_Estimation |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Decoding per-person keypoint coordinates and visibility scores from neural network output tensors for human body pose estimation.
Description
Keypoint pose estimation is the task of localizing anatomically meaningful body joints (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles) for each detected person in an image. Modern single-stage pose estimators produce a combined output tensor that contains both bounding box regression parameters and keypoint predictions in a single forward pass, eliminating the need for a separate top-down cropping stage.
The network output layout extends the standard detection format. For each of the N anchor positions, the output contains: bounding box DFL regression values (4 x 16 bins), a single-class (person) confidence score, and K keypoint triplets (x_offset, y_offset, visibility_score) where K is typically 17 for the COCO keypoint skeleton. The keypoint offsets are relative to the grid cell position and must be scaled by the feature map stride and shifted by the cell coordinates to produce absolute image-space coordinates. The visibility score, after sigmoid activation, indicates whether the keypoint is visible and reliably localized.
Post-processing follows the standard detection pipeline: DFL bin decoding for bounding boxes, confidence thresholding, and NMS to eliminate duplicate person detections. For each surviving detection, the associated 17 keypoint triplets are decoded to produce the final pose skeleton, which can then be visualized by drawing line segments between anatomically connected joints.
Usage
Apply this principle for applications requiring human body pose understanding: fitness and exercise tracking, action recognition, gesture-based interfaces, sports analytics, ergonomic assessment, and animation retargeting. The single-stage approach is preferred when real-time performance on mobile or edge devices is required.
Theoretical Basis
Keypoint decoding from grid-relative offsets:
where is the grid cell position, is the stride, and are the predicted offsets for keypoint j. Visibility is .
Output tensor layout (YOLO11-pose):
Tensor shape: [N_boxes, 65] (single-class person detection)
Columns 0..63 : bbox DFL regression (4 sides x 16 bins)
Column 64 : person confidence score
Keypoint tensor: [N_boxes, 51] (17 keypoints x 3 values)
For keypoint j (j = 0..16):
Column j*3 + 0 : x offset
Column j*3 + 1 : y offset
Column j*3 + 2 : visibility score
COCO 17-keypoint skeleton connectivity:
0: nose
1: left_eye 2: right_eye
3: left_ear 4: right_ear
5: left_shoulder 6: right_shoulder
7: left_elbow 8: right_elbow
9: left_wrist 10: right_wrist
11: left_hip 12: right_hip
13: left_knee 14: right_knee
15: left_ankle 16: right_ankle
Skeleton edges: (15,13)(13,11)(16,14)(14,12)(11,12)
(5,11)(6,12)(5,6)(5,7)(6,8)(7,9)(8,10)
(1,2)(0,1)(0,2)(1,3)(2,4)(3,5)(4,6)
Post-processing pipeline:
1. Decode DFL bins -> bounding boxes
2. Filter by confidence threshold
3. Apply NMS (IoU-based)
4. For each surviving detection:
a. For each keypoint j in 0..16:
kp_x = (grid_x + 0.5 + offset_x[j]) * stride
kp_y = (grid_y + 0.5 + offset_y[j]) * stride
kp_vis = sigmoid(vis_score[j])
b. Store keypoints with visibility
5. Draw skeleton connections between visible keypoints