Principle:Bigscience workshop Petals Tokenization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
The process of converting raw text into a sequence of integer token IDs that a language model can process, and the reverse conversion of token IDs back to human-readable text.
Description
Tokenization is a fundamental preprocessing step for all neural language models. A tokenizer splits input text into subword units (tokens) and maps each to a unique integer ID from the model's vocabulary. Modern tokenizers use algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece to handle out-of-vocabulary words by decomposing them into known subword units.
In the context of distributed inference with Petals, tokenization happens entirely on the client side. The tokenizer must match the model being used — loading the tokenizer from the same model repository ensures vocabulary compatibility. The resulting input_ids tensor is then passed through the distributed model pipeline.
Usage
Use this principle whenever converting between text and token representations for language model input or output. Tokenization is always the first step before model inference and the last step after generation. The tokenizer must be loaded from the same model checkpoint as the model itself.
Theoretical Basis
Byte-Pair Encoding (BPE): The dominant tokenization algorithm for large language models.
- Start with a vocabulary of individual characters
- Iteratively merge the most frequent adjacent pair of tokens
- Repeat until vocabulary reaches target size
Key properties:
- Subword decomposition: Unknown words are split into known subword pieces
- Reversible: Token IDs can always be decoded back to text
- Fixed vocabulary: The vocabulary is determined during tokenizer training and frozen
Pseudo-code:
# Encoding (text -> tokens)
tokens = tokenizer.encode("Hello world") # [15496, 995]
# Decoding (tokens -> text)
text = tokenizer.decode([15496, 995]) # "Hello world"