Principle:Speechbrain Speechbrain Whisper Data Tokenization Pipeline

Field	Value
Concept	Building data pipelines that tokenize text using Whisper's built-in tokenizer with proper special tokens
Domains	Data_Engineering, ASR, Tokenization
Related Implementation	Implementation:Speechbrain_Speechbrain_Whisper_Dataio_Prepare

Overview

Whisper uses a byte-level BPE tokenizer with special tokens for language, task, and timestamps. The data pipeline must correctly handle audio loading, resampling, and text tokenization so that the model receives properly formatted inputs during training. Unlike CTC-based ASR which uses external tokenizers (SentencePiece, character-level), Whisper requires its own built-in tokenizer for consistent encoding between training and inference.

Tokenizer Structure

Whisper's tokenizer produces sequences with specific special token structure:

[<|startoftranscript|>] [<|language|>] [<|task|>] [<|notimestamps|>] ... text tokens ... [<|endoftext|>]

The special tokens serve distinct roles:

<|startoftranscript|>: Marks the beginning of the transcription sequence (BOS).
Language token: Specifies the language (e.g., <|fr|> for French). Only present for multilingual models.
Task token: Either <|transcribe|> or <|translate|>.
<|notimestamps|>: Indicates that timestamp prediction is disabled.
<|endoftext|>: Marks the end of the sequence (EOS).

Pipeline Components

The data pipeline consists of two sub-pipelines:

Audio Pipeline

The audio pipeline loads and resamples audio:

Takes the wav field from the CSV manifest.
Reads the audio file using torchaudio.
If the sample rate differs from the target (16kHz), applies resampling via torchaudio.transforms.Resample.
Provides the sig (signal) tensor to the batch.

Text Pipeline

The text pipeline tokenizes transcription text and creates decoder inputs:

Takes the wrd (word) field from the CSV manifest.
Optionally normalizes text using tokenizer.normalize() if normalized_transcripts is enabled.
Encodes text into token IDs using tokenizer.encode(wrd, add_special_tokens=False).
Wraps tokens with special tokens using tokenizer.build_inputs_with_special_tokens(tokens_list), which prepends [<|startoftranscript|>, <|language|>, <|task|>, <|notimestamps|>] and appends [<|endoftext|>].
Creates tokens_bos: all tokens except the last (decoder input for teacher forcing).
Creates tokens_eos: all tokens except the first (decoder target for loss computation).
Creates tokens: the full token sequence (used for evaluation).

Teacher Forcing

During training, the decoder uses teacher forcing: it receives the ground-truth previous tokens as input (tokens_bos) and is trained to predict the next token at each position (tokens_eos). This is implemented by shifting the full token sequence:

Full:       [BOS, lang, task, notimestamps, t1, t2, ..., tn, EOS]
tokens_bos: [BOS, lang, task, notimestamps, t1, t2, ..., tn]       (decoder input)
tokens_eos: [lang, task, notimestamps, t1, t2, ..., tn, EOS]       (decoder target)

Output Keys

The pipeline produces batches with the following output keys:

id: Utterance identifier string.
sig: Audio signal tensor (resampled to 16kHz).
tokens_list: Raw token IDs (without special tokens).
tokens_bos: BOS-prepended token sequence (LongTensor) for decoder input.
tokens_eos: EOS-appended token sequence (LongTensor) for loss target.
tokens: Full token sequence (LongTensor) for evaluation.

Difference from CTC Tokenization

In CTC-based ASR, text is tokenized with an external tokenizer (e.g., SentencePiece) and the token vocabulary is independent of the acoustic model. For Whisper, the tokenizer is an integral part of the model and must be used consistently. The build_inputs_with_special_tokens method handles the language/task prefix insertion automatically based on the tokenizer's configuration, ensuring that training data matches the model's expected input format.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment