Principle:Tencent Ncnn Face Alignment And Recognition

Knowledge Sources	Tencent_Ncnn
Domains	Computer Vision, Biometrics
Last Updated	2026-02-09 19:00 GMT

Overview

A two-stage pipeline that first detects faces with landmark regression, then applies affine alignment and embedding extraction to measure face similarity.

Description

Face alignment and recognition is a compound vision task that decomposes face identification into two sequential stages. In the first stage, a face detector locates bounding boxes in an image and simultaneously regresses a set of facial landmarks (typically five key points: left eye, right eye, nose tip, left mouth corner, right mouth corner). These landmarks serve as geometric anchors for the second stage.

In the second stage, the detected landmarks are used to compute a 2D affine transformation that warps the face crop into a canonical, frontalized pose. This alignment step normalizes for in-plane rotation, scale, and translation, ensuring that the subsequent embedding network receives a consistently posed face image. The embedding network (e.g., an ArcFace model) maps the aligned face into a compact feature vector in a high-dimensional space where cosine similarity or Euclidean distance directly corresponds to identity similarity.

The principle relies on the observation that geometric normalization prior to feature extraction dramatically improves recognition accuracy, because the embedding network no longer needs to learn invariance to pose variation.

Usage

This principle applies whenever an application must verify or identify individuals from images or video frames. Common scenarios include:

Face verification: Determining whether two face images belong to the same person (1:1 matching).
Face identification: Matching a probe face against a gallery of known identities (1:N search).
Access control systems: Unlocking devices or granting physical access based on face similarity.
Photo organization: Clustering and tagging photographs by detected identities.

Theoretical Basis

The affine alignment is computed from landmark correspondences:

Given:
  src_landmarks = detected 5-point landmarks in the original image
  dst_landmarks = canonical reference landmarks (e.g., 112x112 template)

1. Compute affine matrix M such that:
     M = argmin_M || M * src_landmarks - dst_landmarks ||^2
   This is solved via least-squares on the 2x3 affine parameters.

2. Warp the face region:
     aligned_face = warpAffine(image, M, output_size=(112, 112))

The embedding extraction and similarity computation follow:

3. Extract embedding:
     feature_vector = EmbeddingNetwork(aligned_face)   // e.g., 128-d or 512-d vector
     feature_vector = feature_vector / ||feature_vector||   // L2 normalize

4. Compute similarity between two faces:
     similarity = dot(feature_a, feature_b)   // cosine similarity
     is_same_person = similarity > threshold   // typically threshold ~ 0.3-0.5

The ArcFace loss used to train the embedding network adds an angular margin to the softmax:

$L = - \log \frac{e^{s \cdot \cos (θ_{y_{i}} + m)}}{e^{s \cdot \cos (θ_{y_{i}} + m)} + \sum_{j \neq y_{i}} e^{s \cdot \cos (θ_{j})}}$

where $s$ is a scaling factor, $m$ is the additive angular margin, and $θ_{y_{i}}$ is the angle between the feature vector and the weight vector for the correct class.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment