Implementation:EvolvingLMMs Lab Lmms eval Open ASR Utils
Task utility functions for the Open ASR (Automatic Speech Recognition) benchmark, which evaluates speech recognition models using Word Error Rate (WER) metrics.
Location
/tmp/kapso_repo_sslb_59s/lmms_eval/tasks/open_asr/utils.py
Overview
Provides audio processing, result handling, and WER computation for ASR tasks. Supports multiple languages (English, Chinese, Yue) with language-specific text normalization and tokenization.
Core Functions
Document Processing
openasr_doc_to_audio(doc)- Extracts audio file path from document
- Parameters:
doc- Document dictionary - Process: Tries keys in order:
audio,file,path,audio_path - Returns: List containing audio file path
- Raises:
KeyErrorif no audio field found
openasr_doc_to_text(doc, lmms_eval_specific_kwargs)- Constructs fixed ASR prompt
- Parameters:
doc,lmms_eval_specific_kwargs(with prompts) - Returns:
"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}"
openasr_doc_to_target(doc)- Extracts ground truth transcription
- Process: Tries keys in order:
text,transcript,gt - Returns: Ground truth string
- Raises:
KeyErrorif no target field found
Result Processing
openasr_process_result(doc, result)- Packages prediction with ground truth for WER computation
- Parameters:
doc- Documentresult- Model prediction list
- Returns: Dictionary with
werentry containinggtandpred
Text Normalization
remove_sp(text, language)- Removes special tokens and normalizes spacing
- Parameters:
text- Input textlanguage- Language code ("zh", "en", etc.)
- Process:
- Removes tokens matching
<|.*|> - Collapses consecutive spaces to single space
- Removes space before punctuation
- Left-strips whitespace
- For Chinese: removes all spaces
- Removes tokens matching
- Returns: Normalized text string
EvaluationTokenizer Class
Language-aware tokenizer using sacreBLEU tokenizers.
Initialization
EvaluationTokenizer(
tokenizer_type="13a",
lowercase=False,
punctuation_removal=False,
character_tokenization=False
)
Parameters
tokenizer_type: One of "none", "13a", "intl", "zh", "ja-mecab", "char"lowercase: Apply lowercasingpunctuation_removal: Remove punctuation tokenscharacter_tokenization: Tokenize to character level
Constants
SPACE= chr(32)SPACE_ESCAPE= chr(9601)
Methods
remove_punctuation(sent)(classmethod)- Removes tokens that are purely punctuation
- Parameters:
sent- Space-separated tokens - Returns: String with punctuation-only tokens removed
tokenize(sent)- Applies tokenization pipeline
- Process:
- Apply sacreBLEU tokenizer
- Optionally remove punctuation
- Optionally tokenize to characters
- Optionally lowercase
- Returns: Tokenized string
WER Computation
compute_wer(refs, hyps, language)
Computes Word Error Rate using edit distance.
Parameters:
refs- List of reference transcriptionshyps- List of hypothesis transcriptionslanguage- Language code
Process:
- For each ref-hyp pair:
- Apply language-specific normalization:
yue: Convert to simplified Chinese viazhconven: Applyenglish_normalizerzh: Applychinese_normalizer- Other: Apply
basic_normalizer
- Tokenize with EvaluationTokenizer (none type, lowercase, punct removal)
- For Chinese/Yue: character-level tokenization
- Compute edit distance between token sequences
- Apply language-specific normalization:
- Return total distance / total reference length
Returns: WER as decimal (0-1)
openasr_wer(results, args)
Aggregates WER across all results.
Parameters:
results- List of result dictionaries withgtandpredargs- Arguments (currently unused; language hardcoded to "en")
Process:
- Extract ground truth and predictions
- Apply
remove_spnormalization - Compute WER via
compute_wer - Return WER × 100
Returns: WER percentage (0-100)
Note: Contains commented legacy code for multi-source dataset evaluation.
Global Normalizers
Initialized at module level:
english_normalizer:EnglishTextNormalizer()chinese_normalizer:TextNorm(...)with custom configbasic_normalizer:BasicTextNormalizer()
Chinese normalizer configuration:
- All flags set to
False(no banjiao conversion, case changes, filler/erhua removal) - Empty
cc_mode
Dependencies
os,re,unicodedataeditdistanceasedzhconv- Chinese variant conversionlmms_eval.tasks.librispeech.cn_tn.TextNormlmms_eval.tasks.librispeech.whisper_normalizer.basic.BasicTextNormalizerlmms_eval.tasks.librispeech.whisper_normalizer.english.EnglishTextNormalizersacrebleu.tokenizers- Various tokenizer implementations
Constants
PUNCS = "!,.?;:"- Punctuation characters for normalizationdir_name- Absolute path to module directory
Related
- Task_Utility_Functions - General task utility pattern
- LibriSpeech_Utils - Related speech recognition utilities
- WER_Metric - Word Error Rate metric details