Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Split To Word Tokens

From Leeroopedia

Overview

split_to_word_tokens() is a method on the Tokenizer class that segments a list of subword token IDs into word-level groups. It delegates to one of two strategies -- unicode-based splitting for CJK languages or space-based splitting for all other languages -- and returns parallel lists of words and their corresponding token ID lists.

Source

Signatures

Main Method

def split_to_word_tokens(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:

Unicode-Based Splitting

def split_tokens_on_unicode(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:

Space-Based Splitting

def split_tokens_on_spaces(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:

Parameters

Parameter Type Description
tokens List[int] Token IDs to split into word-level groups

Return Value

Returns a Tuple[List[str], List[List[int]]]:

  • First element: List of word strings
  • Second element: List of token ID lists, where each inner list contains the token IDs composing the corresponding word

Behavior

split_to_word_tokens (dispatcher)

Source: whisper/tokenizer.py:L277-284

Selects the splitting strategy based on the tokenizer's language:

  • For CJK languages (zh, ja, th, lo, my, yue): delegates to split_tokens_on_unicode()
  • For all other languages: delegates to split_tokens_on_spaces()

split_tokens_on_unicode

Source: whisper/tokenizer.py:L286-309

  1. Decodes each token individually to its text representation.
  2. Tracks unicode character boundaries across tokens.
  3. Splits at each valid unicode code point, producing character-level groups.
  4. Returns each character as a separate "word" with its corresponding token IDs.

This approach is appropriate for languages without whitespace word delimiters.

split_tokens_on_spaces

Source: whisper/tokenizer.py:L311-327

  1. First calls split_tokens_on_unicode() to get character-level groups.
  2. Then merges consecutive subwords that do not start with a space character into the same word group.
  3. A subword that starts with a space character begins a new word group.
  4. Returns the merged words and their combined token ID lists.

Example Usage

from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(True, num_languages=100, language="en")
words, word_tokens = tokenizer.split_to_word_tokens(decoded_token_ids)
for word, tokens in zip(words, word_tokens):
    print(f"'{word}' -> {tokens}")
# ' Hello' -> [2425]
# ' world' -> [1002]

CJK Example

tokenizer_zh = get_tokenizer(True, num_languages=100, language="zh")
words, word_tokens = tokenizer_zh.split_to_word_tokens(chinese_token_ids)
for word, tokens in zip(words, word_tokens):
    print(f"'{word}' -> {tokens}")
# Each Chinese character becomes its own word entry

Language Classification

Strategy Languages Rationale
Unicode splitting zh, ja, th, lo, my, yue No whitespace between words; character-level grouping is appropriate
Space splitting All others (en, fr, de, es, ...) Words are delimited by spaces in the tokenized text

Links

Principle:Openai_Whisper_Word_Boundary_Detection

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment