Implementation:EvolvingLMMs Lab Lmms eval Open ASR Utils

Task utility functions for the Open ASR (Automatic Speech Recognition) benchmark, which evaluates speech recognition models using Word Error Rate (WER) metrics.

Location

/tmp/kapso_repo_sslb_59s/lmms_eval/tasks/open_asr/utils.py

Overview

Provides audio processing, result handling, and WER computation for ASR tasks. Supports multiple languages (English, Chinese, Yue) with language-specific text normalization and tokenization.

Core Functions

Document Processing

openasr_doc_to_audio(doc): Extracts audio file path from document; Parameters: doc - Document dictionary; Process: Tries keys in order: audio, file, path, audio_path; Returns: List containing audio file path; Raises: KeyError if no audio field found

openasr_doc_to_text(doc, lmms_eval_specific_kwargs): Constructs fixed ASR prompt; Parameters: doc, lmms_eval_specific_kwargs (with prompts); Returns: "{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}"

openasr_doc_to_target(doc): Extracts ground truth transcription; Process: Tries keys in order: text, transcript, gt; Returns: Ground truth string; Raises: KeyError if no target field found

Result Processing

openasr_process_result(doc, result)

Packages prediction with ground truth for WER computation

Parameters:

doc - Document
result - Model prediction list

Returns: Dictionary with wer entry containing gt and pred

Text Normalization

remove_sp(text, language)

Removes special tokens and normalizes spacing

Parameters:

text - Input text
language - Language code ("zh", "en", etc.)

Process:

Removes tokens matching <|.*|>
Collapses consecutive spaces to single space
Removes space before punctuation
Left-strips whitespace
For Chinese: removes all spaces

Returns: Normalized text string

EvaluationTokenizer Class

Language-aware tokenizer using sacreBLEU tokenizers.

Initialization

EvaluationTokenizer(
    tokenizer_type="13a",
    lowercase=False,
    punctuation_removal=False,
    character_tokenization=False
)

Parameters

tokenizer_type: One of "none", "13a", "intl", "zh", "ja-mecab", "char"
lowercase: Apply lowercasing
punctuation_removal: Remove punctuation tokens
character_tokenization: Tokenize to character level

Constants

SPACE = chr(32)
SPACE_ESCAPE = chr(9601)

Methods

remove_punctuation(sent) (classmethod): Removes tokens that are purely punctuation; Parameters: sent - Space-separated tokens; Returns: String with punctuation-only tokens removed

tokenize(sent)

Applies tokenization pipeline

Process:

Apply sacreBLEU tokenizer
Optionally remove punctuation
Optionally tokenize to characters
Optionally lowercase

Returns: Tokenized string

WER Computation

`compute_wer(refs, hyps, language)`

Computes Word Error Rate using edit distance.

Parameters:

refs - List of reference transcriptions
hyps - List of hypothesis transcriptions
language - Language code

Process:

For each ref-hyp pair:
1. Apply language-specific normalization:
  - yue: Convert to simplified Chinese via zhconv
  - en: Apply english_normalizer
  - zh: Apply chinese_normalizer
  - Other: Apply basic_normalizer
2. Tokenize with EvaluationTokenizer (none type, lowercase, punct removal)
3. For Chinese/Yue: character-level tokenization
4. Compute edit distance between token sequences
Return total distance / total reference length

Returns: WER as decimal (0-1)

`openasr_wer(results, args)`

Aggregates WER across all results.

Parameters:

results - List of result dictionaries with gt and pred
args - Arguments (currently unused; language hardcoded to "en")

Process:

Extract ground truth and predictions
Apply remove_sp normalization
Compute WER via compute_wer
Return WER × 100

Returns: WER percentage (0-100)

Note: Contains commented legacy code for multi-source dataset evaluation.

Global Normalizers

Initialized at module level:

english_normalizer: EnglishTextNormalizer()
chinese_normalizer: TextNorm(...) with custom config
basic_normalizer: BasicTextNormalizer()

Chinese normalizer configuration:

All flags set to False (no banjiao conversion, case changes, filler/erhua removal)
Empty cc_mode

Dependencies

os, re, unicodedata
editdistance as ed
zhconv - Chinese variant conversion
lmms_eval.tasks.librispeech.cn_tn.TextNorm
lmms_eval.tasks.librispeech.whisper_normalizer.basic.BasicTextNormalizer
lmms_eval.tasks.librispeech.whisper_normalizer.english.EnglishTextNormalizer
sacrebleu.tokenizers - Various tokenizer implementations

Constants

PUNCS = "!,.?;:" - Punctuation characters for normalization
dir_name - Absolute path to module directory

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Location

Overview

Core Functions

Document Processing

Result Processing

Text Normalization

EvaluationTokenizer Class

Initialization

Parameters

Constants

Methods

WER Computation

compute_wer(refs, hyps, language)

openasr_wer(results, args)

Global Normalizers

Dependencies

Constants

Related

Page Connections

`compute_wer(refs, hyps, language)`

`openasr_wer(results, args)`