Implementation:Openai Whisper Split To Word Tokens

Overview

split_to_word_tokens() is a method on the Tokenizer class that segments a list of subword token IDs into word-level groups. It delegates to one of two strategies -- unicode-based splitting for CJK languages or space-based splitting for all other languages -- and returns parallel lists of words and their corresponding token ID lists.

Source

File: whisper/tokenizer.py, lines 277-327
Repository: https://github.com/openai/whisper
Import: from whisper.tokenizer import Tokenizer

Signatures

Main Method

def split_to_word_tokens(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:

Unicode-Based Splitting

def split_tokens_on_unicode(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:

Space-Based Splitting

def split_tokens_on_spaces(self, tokens: List[int]) -> Tuple[List[str], List[List[int]]]:

Parameters

Parameter	Type	Description
tokens	List[int]	Token IDs to split into word-level groups

Return Value

Returns a Tuple[List[str], List[List[int]]]:

First element: List of word strings
Second element: List of token ID lists, where each inner list contains the token IDs composing the corresponding word

Behavior

split_to_word_tokens (dispatcher)

Source: whisper/tokenizer.py:L277-284

Selects the splitting strategy based on the tokenizer's language:

For CJK languages (zh, ja, th, lo, my, yue): delegates to split_tokens_on_unicode()
For all other languages: delegates to split_tokens_on_spaces()

split_tokens_on_unicode

Source: whisper/tokenizer.py:L286-309

Decodes each token individually to its text representation.
Tracks unicode character boundaries across tokens.
Splits at each valid unicode code point, producing character-level groups.
Returns each character as a separate "word" with its corresponding token IDs.

This approach is appropriate for languages without whitespace word delimiters.

split_tokens_on_spaces

Source: whisper/tokenizer.py:L311-327

First calls split_tokens_on_unicode() to get character-level groups.
Then merges consecutive subwords that do not start with a space character into the same word group.
A subword that starts with a space character begins a new word group.
Returns the merged words and their combined token ID lists.

Example Usage

from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(True, num_languages=100, language="en")
words, word_tokens = tokenizer.split_to_word_tokens(decoded_token_ids)
for word, tokens in zip(words, word_tokens):
    print(f"'{word}' -> {tokens}")
# ' Hello' -> [2425]
# ' world' -> [1002]

CJK Example

tokenizer_zh = get_tokenizer(True, num_languages=100, language="zh")
words, word_tokens = tokenizer_zh.split_to_word_tokens(chinese_token_ids)
for word, tokens in zip(words, word_tokens):
    print(f"'{word}' -> {tokens}")
# Each Chinese character becomes its own word entry

Language Classification

Strategy	Languages	Rationale
Unicode splitting	zh, ja, th, lo, my, yue	No whitespace between words; character-level grouping is appropriate
Space splitting	All others (en, fr, de, es, ...)	Words are delimited by spaces in the tokenized text

Links

Principle:Openai_Whisper_Word_Boundary_Detection

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment