Principle:Tencent Ncnn Crowd Counting

Knowledge Sources	Tencent_Ncnn
Domains	Computer Vision, Scene Understanding
Last Updated	2026-02-09 19:00 GMT

Overview

A point-based crowd counting approach that predicts individual person locations by associating anchor grid positions with learned spatial offsets, producing both a total count and a set of point localizations for each person in the scene.

Description

Crowd counting via point-based prediction represents a departure from traditional density-map regression approaches. Instead of predicting a continuous density map whose integral approximates the person count, point-based methods directly predict a set of discrete point locations, one per person, along with confidence scores indicating whether each predicted point corresponds to a real person.

The method operates on a predefined anchor grid: a regular grid of reference positions overlaid on the feature map produced by a backbone network. Each anchor point is responsible for predicting whether a person exists near its location and, if so, the spatial offset from the anchor position to the precise person location. This anchor-offset formulation is analogous to anchor-based object detection but simplified to point prediction rather than bounding box regression.

The P2PNet (Point-to-Point Network) architecture implements this principle using a set of learned point queries that are matched to ground-truth person annotations during training via the Hungarian algorithm for optimal bipartite matching. During inference, the network outputs a fixed number of candidate points, each with a classification score (person vs. background) and a 2D offset. Points with scores above a confidence threshold are retained as the final predictions.

This approach has several advantages over density-map methods: it provides individual localizations (not just a count), avoids the Gaussian kernel size ambiguity inherent in density map generation, and can be trained end-to-end without hand-crafted post-processing.

Usage

This principle applies in scenarios requiring estimation of the number and locations of people in images:

Public safety monitoring: Estimating crowd density at events, transportation hubs, or public spaces.
Retail analytics: Counting customers in stores for occupancy management.
Urban planning: Analyzing pedestrian flow patterns from surveillance or drone imagery.
Event management: Monitoring venue capacity in real time.

Theoretical Basis

The point-based prediction pipeline in pseudo-code:

// Feature extraction
features = Backbone(image)          // e.g., VGG-16 or ResNet features
enhanced = FPN(features)            // optional feature pyramid refinement

// Generate anchor grid
anchors = generate_grid(
    height = feature_map_height,
    width = feature_map_width,
    stride = downsample_stride       // e.g., 8 pixels
)   // shape: (H*W, 2) representing (x, y) positions

// Predict offsets and confidence for each anchor
offsets = OffsetHead(enhanced)       // shape: (H*W, 2) - dx, dy per anchor
scores = ClassificationHead(enhanced) // shape: (H*W, 2) - person vs background

// Decode predicted points
predicted_points = anchors + offsets  // apply offsets to anchor positions
confidence = softmax(scores)[:, 1]   // person class probability

During training, the Hungarian matching finds the optimal assignment between predicted and ground-truth points:

// Compute cost matrix between predictions and ground truth
for each predicted point p_i:
    for each ground truth point g_j:
        cost[i][j] = lambda_cls * classification_cost(p_i, g_j)
                   + lambda_loc * ||point_i - g_j||_2

// Find optimal one-to-one assignment
assignment = hungarian_algorithm(cost)

// Compute loss on matched pairs
loss = 0
for (pred_idx, gt_idx) in assignment:
    loss += cross_entropy(scores[pred_idx], 1)    // matched: person class
    loss += L2_distance(points[pred_idx], gt[gt_idx])

// Unmatched predictions penalized as background
for pred_idx not in assignment:
    loss += cross_entropy(scores[pred_idx], 0)     // background class

At inference, the final count and locations are obtained by thresholding:

// Post-processing
mask = confidence > threshold        // e.g., threshold = 0.5
final_points = predicted_points[mask]
person_count = sum(mask)

Related Pages

Implementation:Tencent_Ncnn_P2PNet_Example

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment