Principle:Speechbrain Speechbrain Whisper Data Tokenization Pipeline
| Field | Value |
|---|---|
| Concept | Building data pipelines that tokenize text using Whisper's built-in tokenizer with proper special tokens |
| Domains | Data_Engineering, ASR, Tokenization |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Whisper_Dataio_Prepare |
Overview
Whisper uses a byte-level BPE tokenizer with special tokens for language, task, and timestamps. The data pipeline must correctly handle audio loading, resampling, and text tokenization so that the model receives properly formatted inputs during training. Unlike CTC-based ASR which uses external tokenizers (SentencePiece, character-level), Whisper requires its own built-in tokenizer for consistent encoding between training and inference.
Tokenizer Structure
Whisper's tokenizer produces sequences with specific special token structure:
[<|startoftranscript|>] [<|language|>] [<|task|>] [<|notimestamps|>] ... text tokens ... [<|endoftext|>]
The special tokens serve distinct roles:
- <|startoftranscript|>: Marks the beginning of the transcription sequence (BOS).
- Language token: Specifies the language (e.g., <|fr|> for French). Only present for multilingual models.
- Task token: Either <|transcribe|> or <|translate|>.
- <|notimestamps|>: Indicates that timestamp prediction is disabled.
- <|endoftext|>: Marks the end of the sequence (EOS).
Pipeline Components
The data pipeline consists of two sub-pipelines:
Audio Pipeline
The audio pipeline loads and resamples audio:
- Takes the wav field from the CSV manifest.
- Reads the audio file using torchaudio.
- If the sample rate differs from the target (16kHz), applies resampling via torchaudio.transforms.Resample.
- Provides the sig (signal) tensor to the batch.
Text Pipeline
The text pipeline tokenizes transcription text and creates decoder inputs:
- Takes the wrd (word) field from the CSV manifest.
- Optionally normalizes text using tokenizer.normalize() if normalized_transcripts is enabled.
- Encodes text into token IDs using tokenizer.encode(wrd, add_special_tokens=False).
- Wraps tokens with special tokens using tokenizer.build_inputs_with_special_tokens(tokens_list), which prepends [<|startoftranscript|>, <|language|>, <|task|>, <|notimestamps|>] and appends [<|endoftext|>].
- Creates tokens_bos: all tokens except the last (decoder input for teacher forcing).
- Creates tokens_eos: all tokens except the first (decoder target for loss computation).
- Creates tokens: the full token sequence (used for evaluation).
Teacher Forcing
During training, the decoder uses teacher forcing: it receives the ground-truth previous tokens as input (tokens_bos) and is trained to predict the next token at each position (tokens_eos). This is implemented by shifting the full token sequence:
Full: [BOS, lang, task, notimestamps, t1, t2, ..., tn, EOS]
tokens_bos: [BOS, lang, task, notimestamps, t1, t2, ..., tn] (decoder input)
tokens_eos: [lang, task, notimestamps, t1, t2, ..., tn, EOS] (decoder target)
Output Keys
The pipeline produces batches with the following output keys:
- id: Utterance identifier string.
- sig: Audio signal tensor (resampled to 16kHz).
- tokens_list: Raw token IDs (without special tokens).
- tokens_bos: BOS-prepended token sequence (LongTensor) for decoder input.
- tokens_eos: EOS-appended token sequence (LongTensor) for loss target.
- tokens: Full token sequence (LongTensor) for evaluation.
Difference from CTC Tokenization
In CTC-based ASR, text is tokenized with an external tokenizer (e.g., SentencePiece) and the token vocabulary is independent of the acoustic model. For Whisper, the tokenizer is an integral part of the model and must be used consistently. The build_inputs_with_special_tokens method handles the language/task prefix insertion automatically based on the tokenizer's configuration, ensuring that training data matches the model's expected input format.