Implementation:Openai Whisper Split To Word Tokens
Overview
split_to_word_tokens() is a method on the Tokenizer class that segments a list of subword token IDs into word-level groups. It delegates to one of two strategies -- unicode-based splitting for CJK languages or space-based splitting for all other languages -- and returns parallel lists of words and their corresponding token ID lists.
Source
- File:
whisper/tokenizer.py, lines 277-327 - Repository: https://github.com/openai/whisper
- Import:
from whisper.tokenizer import Tokenizer
Signatures
Main Method
def split_to_word_tokens(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:
Unicode-Based Splitting
def split_tokens_on_unicode(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:
Space-Based Splitting
def split_tokens_on_spaces(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:
Parameters
| Parameter | Type | Description |
|---|---|---|
| tokens | List[int] | Token IDs to split into word-level groups |
Return Value
Returns a Tuple[List[str], List[List[int]]]:
- First element: List of word strings
- Second element: List of token ID lists, where each inner list contains the token IDs composing the corresponding word
Behavior
split_to_word_tokens (dispatcher)
Source: whisper/tokenizer.py:L277-284
Selects the splitting strategy based on the tokenizer's language:
- For CJK languages (
zh,ja,th,lo,my,yue): delegates tosplit_tokens_on_unicode() - For all other languages: delegates to
split_tokens_on_spaces()
split_tokens_on_unicode
Source: whisper/tokenizer.py:L286-309
- Decodes each token individually to its text representation.
- Tracks unicode character boundaries across tokens.
- Splits at each valid unicode code point, producing character-level groups.
- Returns each character as a separate "word" with its corresponding token IDs.
This approach is appropriate for languages without whitespace word delimiters.
split_tokens_on_spaces
Source: whisper/tokenizer.py:L311-327
- First calls
split_tokens_on_unicode()to get character-level groups. - Then merges consecutive subwords that do not start with a space character into the same word group.
- A subword that starts with a space character begins a new word group.
- Returns the merged words and their combined token ID lists.
Example Usage
from whisper.tokenizer import get_tokenizer
tokenizer = get_tokenizer(True, num_languages=100, language="en")
words, word_tokens = tokenizer.split_to_word_tokens(decoded_token_ids)
for word, tokens in zip(words, word_tokens):
print(f"'{word}' -> {tokens}")
# ' Hello' -> [2425]
# ' world' -> [1002]
CJK Example
tokenizer_zh = get_tokenizer(True, num_languages=100, language="zh")
words, word_tokens = tokenizer_zh.split_to_word_tokens(chinese_token_ids)
for word, tokens in zip(words, word_tokens):
print(f"'{word}' -> {tokens}")
# Each Chinese character becomes its own word entry
Language Classification
| Strategy | Languages | Rationale |
|---|---|---|
| Unicode splitting | zh, ja, th, lo, my, yue | No whitespace between words; character-level grouping is appropriate |
| Space splitting | All others (en, fr, de, es, ...) | Words are delimited by spaces in the tokenized text |
Links
Principle:Openai_Whisper_Word_Boundary_Detection