Principle:Tencent Ncnn Text Detection And Recognition
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Optical Character Recognition |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
A two-stage optical character recognition pipeline where the first stage detects text regions through segmentation-based methods and the second stage recognizes character sequences within each detected region using CTC or attention-based decoders.
Description
Text detection and recognition (OCR) is a compound vision task that reads text from natural images or documents through two cooperating stages.
The text detection stage identifies where text appears in an image. Modern approaches frame this as a semantic segmentation problem, predicting probability maps that indicate which pixels belong to text regions. A common technique is the Differentiable Binarization (DB) method, which predicts both a probability map and a threshold map. The probability map indicates text likelihood per pixel, while the adaptive threshold map enables differentiable binarization during training. At inference time, a simple fixed threshold is applied to the probability map, and connected component analysis extracts text region contours. These contours are then converted into oriented bounding boxes or polygons that tightly enclose each text instance.
The text recognition stage takes each detected and cropped text region and decodes it into a character string. The cropped region is typically resized to a fixed height while preserving the aspect ratio, then fed through a sequence recognition network. This network consists of a visual feature extractor (CNN backbone), an optional sequence modeling component (BiLSTM), and a decoder that produces the character sequence. Two primary decoding strategies exist:
- CTC (Connectionist Temporal Classification) decoding treats recognition as a sequence labeling problem. The network outputs a probability distribution over characters at each horizontal position, and the CTC algorithm collapses repeated characters and removes blanks to produce the final string.
- Attention-based decoding generates characters autoregressively, attending to relevant parts of the visual feature sequence at each step.
Usage
This principle applies wherever text must be extracted from images:
- Document digitization: Converting scanned documents, receipts, or invoices into editable text.
- Scene text reading: Reading signs, license plates, or product labels from photographs.
- Form processing: Extracting structured data from handwritten or printed forms.
- Translation aids: Reading foreign-language text from camera images for real-time translation.
Theoretical Basis
The text detection stage using the DB (Differentiable Binarization) approach:
// Text Detection
features = Backbone(image) // e.g., ResNet or MobileNet
features = FPN(features) // multi-scale feature fusion
probability_map = ProbabilityHead(features) // P(text) per pixel, shape (H, W)
threshold_map = ThresholdHead(features) // adaptive threshold, shape (H, W)
// During training: differentiable binarization
binary_map = sigmoid(k * (probability_map - threshold_map))
// k is a scaling factor (e.g., 50) that approximates step function
// During inference: simple thresholding
binary_map = (probability_map > 0.3)
// Extract text regions
contours = find_connected_components(binary_map)
text_boxes = contours_to_oriented_boxes(contours)
The text recognition stage with CTC decoding:
// Text Recognition
for each text_box in text_boxes:
// Crop and resize text region
text_crop = perspective_transform(image, text_box)
text_crop = resize(text_crop, height=48, keep_aspect_ratio=True)
// Feature extraction
visual_features = CNN(text_crop) // shape: (W', C)
sequence_features = BiLSTM(visual_features) // shape: (W', hidden)
// CTC decoding
logits = Linear(sequence_features) // shape: (W', num_classes + 1)
// num_classes = character set size, +1 for CTC blank
// Greedy CTC decode
predictions = argmax(logits, dim=-1) // shape: (W',)
text = ctc_collapse(predictions) // remove blanks and duplicates
The CTC collapse operation:
function ctc_collapse(predictions):
result = []
prev = BLANK
for p in predictions:
if p != BLANK and p != prev:
result.append(p)
prev = p
return decode_characters(result)
// Example: [H, H, BLANK, e, l, l, l, BLANK, l, o] -> "Hello"