Principle:Tencent Ncnn Crowd Counting
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Scene Understanding |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
A point-based crowd counting approach that predicts individual person locations by associating anchor grid positions with learned spatial offsets, producing both a total count and a set of point localizations for each person in the scene.
Description
Crowd counting via point-based prediction represents a departure from traditional density-map regression approaches. Instead of predicting a continuous density map whose integral approximates the person count, point-based methods directly predict a set of discrete point locations, one per person, along with confidence scores indicating whether each predicted point corresponds to a real person.
The method operates on a predefined anchor grid: a regular grid of reference positions overlaid on the feature map produced by a backbone network. Each anchor point is responsible for predicting whether a person exists near its location and, if so, the spatial offset from the anchor position to the precise person location. This anchor-offset formulation is analogous to anchor-based object detection but simplified to point prediction rather than bounding box regression.
The P2PNet (Point-to-Point Network) architecture implements this principle using a set of learned point queries that are matched to ground-truth person annotations during training via the Hungarian algorithm for optimal bipartite matching. During inference, the network outputs a fixed number of candidate points, each with a classification score (person vs. background) and a 2D offset. Points with scores above a confidence threshold are retained as the final predictions.
This approach has several advantages over density-map methods: it provides individual localizations (not just a count), avoids the Gaussian kernel size ambiguity inherent in density map generation, and can be trained end-to-end without hand-crafted post-processing.
Usage
This principle applies in scenarios requiring estimation of the number and locations of people in images:
- Public safety monitoring: Estimating crowd density at events, transportation hubs, or public spaces.
- Retail analytics: Counting customers in stores for occupancy management.
- Urban planning: Analyzing pedestrian flow patterns from surveillance or drone imagery.
- Event management: Monitoring venue capacity in real time.
Theoretical Basis
The point-based prediction pipeline in pseudo-code:
// Feature extraction
features = Backbone(image) // e.g., VGG-16 or ResNet features
enhanced = FPN(features) // optional feature pyramid refinement
// Generate anchor grid
anchors = generate_grid(
height = feature_map_height,
width = feature_map_width,
stride = downsample_stride // e.g., 8 pixels
) // shape: (H*W, 2) representing (x, y) positions
// Predict offsets and confidence for each anchor
offsets = OffsetHead(enhanced) // shape: (H*W, 2) - dx, dy per anchor
scores = ClassificationHead(enhanced) // shape: (H*W, 2) - person vs background
// Decode predicted points
predicted_points = anchors + offsets // apply offsets to anchor positions
confidence = softmax(scores)[:, 1] // person class probability
During training, the Hungarian matching finds the optimal assignment between predicted and ground-truth points:
// Compute cost matrix between predictions and ground truth
for each predicted point p_i:
for each ground truth point g_j:
cost[i][j] = lambda_cls * classification_cost(p_i, g_j)
+ lambda_loc * ||point_i - g_j||_2
// Find optimal one-to-one assignment
assignment = hungarian_algorithm(cost)
// Compute loss on matched pairs
loss = 0
for (pred_idx, gt_idx) in assignment:
loss += cross_entropy(scores[pred_idx], 1) // matched: person class
loss += L2_distance(points[pred_idx], gt[gt_idx])
// Unmatched predictions penalized as background
for pred_idx not in assignment:
loss += cross_entropy(scores[pred_idx], 0) // background class
At inference, the final count and locations are obtained by thresholding:
// Post-processing
mask = confidence > threshold // e.g., threshold = 0.5
final_points = predicted_points[mask]
person_count = sum(mask)