Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Whisper Basic Text Normalization

From Leeroopedia
Revision as of 17:23, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Openai_Whisper_Basic_Text_Normalization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Text_Normalization, Evaluation
Last Updated 2026-02-13 22:00 GMT

Overview

Language-agnostic text normalization technique that removes symbols, optionally strips diacritical marks, and standardizes whitespace to enable fair multilingual speech recognition evaluation.

Description

Basic Text Normalization addresses the need for a universal text preprocessing step that works across all languages supported by Whisper. Unlike English-specific normalization (which handles contractions and number spellings), this approach uses Unicode character properties to identify and remove non-linguistic content.

The key challenge is that raw transcripts may contain annotations (bracketed text, parenthetical notes), punctuation, and symbol characters that are irrelevant to speech recognition accuracy. Additionally, diacritical marks may or may not be significant depending on the evaluation context; for instance, French accents carry semantic meaning (e.g., "ou" vs "où") but may need to be ignored for lenient evaluation.

Usage

Use this principle when evaluating Whisper transcription accuracy on non-English languages, or when a lighter-weight normalization is preferred over the full English pipeline. It is appropriate for any language where the primary concern is removing non-speech annotations and standardizing formatting rather than handling language-specific constructs like contractions or number words.

Theoretical Basis

The normalization leverages the Unicode Standard's character categorization system:

  • NFKD (Compatibility Decomposition): Decomposes characters into base characters plus combining marks. For example, "é" becomes "e" + combining acute accent. Removing combining marks (category Mn) strips diacritics.
  • NFKC (Compatibility Composition): Decomposes and then recomposes characters. This normalizes equivalent representations while preserving diacritics.
  • Unicode Categories: Characters are classified into major categories:
    • M (Mark): Combining marks, enclosing marks
    • S (Symbol): Math, currency, modifier symbols
    • P (Punctuation): Connectors, dashes, quotes, brackets

Algorithm:

# Abstract normalization algorithm
for character in NFKD(text):
    if character in ADDITIONAL_DIACRITICS:
        emit replacement (e.g., "oe" for "œ")
    elif category is "Mn" (combining mark):
        skip (strips diacritics)
    elif category starts with M, S, or P:
        emit space (removes symbols)
    else:
        emit character as-is

The ADDITIONAL_DIACRITICS table handles special cases where NFKD decomposition does not produce a useful ASCII approximation, such as ligatures (œ → oe, æ → ae), special letters (ß → ss, ð → d, þ → th), and stroked letters (ł → l, đ → d).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment