Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Tencent Ncnn Text Detection And Recognition

From Leeroopedia


Knowledge Sources
Domains Computer Vision, Optical Character Recognition
Last Updated 2026-02-09 19:00 GMT

Overview

A two-stage optical character recognition pipeline where the first stage detects text regions through segmentation-based methods and the second stage recognizes character sequences within each detected region using CTC or attention-based decoders.

Description

Text detection and recognition (OCR) is a compound vision task that reads text from natural images or documents through two cooperating stages.

The text detection stage identifies where text appears in an image. Modern approaches frame this as a semantic segmentation problem, predicting probability maps that indicate which pixels belong to text regions. A common technique is the Differentiable Binarization (DB) method, which predicts both a probability map and a threshold map. The probability map indicates text likelihood per pixel, while the adaptive threshold map enables differentiable binarization during training. At inference time, a simple fixed threshold is applied to the probability map, and connected component analysis extracts text region contours. These contours are then converted into oriented bounding boxes or polygons that tightly enclose each text instance.

The text recognition stage takes each detected and cropped text region and decodes it into a character string. The cropped region is typically resized to a fixed height while preserving the aspect ratio, then fed through a sequence recognition network. This network consists of a visual feature extractor (CNN backbone), an optional sequence modeling component (BiLSTM), and a decoder that produces the character sequence. Two primary decoding strategies exist:

  • CTC (Connectionist Temporal Classification) decoding treats recognition as a sequence labeling problem. The network outputs a probability distribution over characters at each horizontal position, and the CTC algorithm collapses repeated characters and removes blanks to produce the final string.
  • Attention-based decoding generates characters autoregressively, attending to relevant parts of the visual feature sequence at each step.

Usage

This principle applies wherever text must be extracted from images:

  • Document digitization: Converting scanned documents, receipts, or invoices into editable text.
  • Scene text reading: Reading signs, license plates, or product labels from photographs.
  • Form processing: Extracting structured data from handwritten or printed forms.
  • Translation aids: Reading foreign-language text from camera images for real-time translation.

Theoretical Basis

The text detection stage using the DB (Differentiable Binarization) approach:

// Text Detection
features = Backbone(image)               // e.g., ResNet or MobileNet
features = FPN(features)                  // multi-scale feature fusion

probability_map = ProbabilityHead(features)   // P(text) per pixel, shape (H, W)
threshold_map = ThresholdHead(features)       // adaptive threshold, shape (H, W)

// During training: differentiable binarization
binary_map = sigmoid(k * (probability_map - threshold_map))
// k is a scaling factor (e.g., 50) that approximates step function

// During inference: simple thresholding
binary_map = (probability_map > 0.3)

// Extract text regions
contours = find_connected_components(binary_map)
text_boxes = contours_to_oriented_boxes(contours)

The text recognition stage with CTC decoding:

// Text Recognition
for each text_box in text_boxes:
    // Crop and resize text region
    text_crop = perspective_transform(image, text_box)
    text_crop = resize(text_crop, height=48, keep_aspect_ratio=True)

    // Feature extraction
    visual_features = CNN(text_crop)        // shape: (W', C)
    sequence_features = BiLSTM(visual_features)  // shape: (W', hidden)

    // CTC decoding
    logits = Linear(sequence_features)      // shape: (W', num_classes + 1)
    // num_classes = character set size, +1 for CTC blank

    // Greedy CTC decode
    predictions = argmax(logits, dim=-1)    // shape: (W',)
    text = ctc_collapse(predictions)        // remove blanks and duplicates

The CTC collapse operation:

function ctc_collapse(predictions):
    result = []
    prev = BLANK
    for p in predictions:
        if p != BLANK and p != prev:
            result.append(p)
        prev = p
    return decode_characters(result)

// Example: [H, H, BLANK, e, l, l, l, BLANK, l, o] -> "Hello"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment