Principle:Tencent Ncnn Keypoint Pose Estimation

Knowledge Sources	Tencent_Ncnn
Domains	Computer_Vision, Human_Pose_Estimation
Last Updated	2026-02-09 19:00 GMT

Overview

Decoding per-person keypoint coordinates and visibility scores from neural network output tensors for human body pose estimation.

Description

Keypoint pose estimation is the task of localizing anatomically meaningful body joints (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles) for each detected person in an image. Modern single-stage pose estimators produce a combined output tensor that contains both bounding box regression parameters and keypoint predictions in a single forward pass, eliminating the need for a separate top-down cropping stage.

The network output layout extends the standard detection format. For each of the N anchor positions, the output contains: bounding box DFL regression values (4 x 16 bins), a single-class (person) confidence score, and K keypoint triplets (x_offset, y_offset, visibility_score) where K is typically 17 for the COCO keypoint skeleton. The keypoint offsets are relative to the grid cell position and must be scaled by the feature map stride and shifted by the cell coordinates to produce absolute image-space coordinates. The visibility score, after sigmoid activation, indicates whether the keypoint is visible and reliably localized.

Post-processing follows the standard detection pipeline: DFL bin decoding for bounding boxes, confidence thresholding, and NMS to eliminate duplicate person detections. For each surviving detection, the associated 17 keypoint triplets are decoded to produce the final pose skeleton, which can then be visualized by drawing line segments between anatomically connected joints.

Usage

Apply this principle for applications requiring human body pose understanding: fitness and exercise tracking, action recognition, gesture-based interfaces, sports analytics, ergonomic assessment, and animation retargeting. The single-stage approach is preferred when real-time performance on mobile or edge devices is required.

Theoretical Basis

Keypoint decoding from grid-relative offsets: $x_{j} = (g_{x} + 0.5 + Δ x_{j}) \times s, y_{j} = (g_{y} + 0.5 + Δ y_{j}) \times s$

where $(g_{x}, g_{y})$ is the grid cell position, $s$ is the stride, and $(Δ x_{j}, Δ y_{j})$ are the predicted offsets for keypoint j. Visibility is $v_{j} = σ (v_{j}^{raw})$ .

Output tensor layout (YOLO11-pose):

Tensor shape: [N_boxes, 65]   (single-class person detection)
  Columns  0..63 : bbox DFL regression (4 sides x 16 bins)
  Column   64    : person confidence score

Keypoint tensor: [N_boxes, 51]   (17 keypoints x 3 values)
  For keypoint j (j = 0..16):
    Column j*3 + 0 : x offset
    Column j*3 + 1 : y offset
    Column j*3 + 2 : visibility score

COCO 17-keypoint skeleton connectivity:

 0: nose
 1: left_eye       2: right_eye
 3: left_ear       4: right_ear
 5: left_shoulder  6: right_shoulder
 7: left_elbow     8: right_elbow
 9: left_wrist    10: right_wrist
11: left_hip      12: right_hip
13: left_knee     14: right_knee
15: left_ankle    16: right_ankle

Skeleton edges: (15,13)(13,11)(16,14)(14,12)(11,12)
                (5,11)(6,12)(5,6)(5,7)(6,8)(7,9)(8,10)
                (1,2)(0,1)(0,2)(1,3)(2,4)(3,5)(4,6)

Post-processing pipeline:

1. Decode DFL bins -> bounding boxes
2. Filter by confidence threshold
3. Apply NMS (IoU-based)
4. For each surviving detection:
   a. For each keypoint j in 0..16:
      kp_x = (grid_x + 0.5 + offset_x[j]) * stride
      kp_y = (grid_y + 0.5 + offset_y[j]) * stride
      kp_vis = sigmoid(vis_score[j])
   b. Store keypoints with visibility
5. Draw skeleton connections between visible keypoints

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment