Principle:Huggingface Transformers Tokenization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Text Processing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Tokenization is the process of converting raw text into a sequence of discrete tokens (integer identifiers) that a neural network can process.
Description
Transformer models cannot operate directly on strings. Tokenization bridges the gap between human-readable text and model-consumable numerical tensors by segmenting text into sub-word units and mapping each unit to an integer index in the model's vocabulary. The choice of tokenization algorithm (BPE, WordPiece, Unigram, SentencePiece) directly affects the vocabulary size, the ability to handle out-of-vocabulary words, and the sequence lengths the model must process.
In the HuggingFace Transformers library, tokenization is tightly coupled with the model: each pretrained model ships with a matching tokenizer that knows the correct vocabulary, special tokens (e.g., [CLS], [SEP], , ), and encoding rules. Loading the wrong tokenizer for a model produces garbage input and meaningless outputs.
Usage
Tokenization should be applied:
- After data loading, before feeding data to the model or Trainer.
- Using the tokenizer that matches the pretrained model being fine-tuned.
- Typically via dataset.map() with batched=True for efficient batch tokenization.
- Whenever text data needs to include special tokens, padding, truncation, or attention masks.
Theoretical Basis
Modern tokenization algorithms solve the open-vocabulary problem by decomposing words into sub-word units:
Byte-Pair Encoding (BPE):
BPE iteratively merges the most frequent pair of adjacent symbols. Starting from a character-level vocabulary, it builds up to a target vocabulary size V:
1. Initialize vocabulary with all characters in the corpus.
2. Repeat until |vocabulary| == V:
a. Count frequency of all adjacent symbol pairs.
b. Merge the most frequent pair into a new symbol.
c. Add the new symbol to the vocabulary.
WordPiece:
Similar to BPE but selects merges based on likelihood rather than frequency:
score(pair) = freq(pair) / (freq(first) * freq(second))
Tokenizer output:
A tokenizer converts a string into a dictionary containing:
| Key | Description |
|---|---|
| input_ids | Integer token indices in the model vocabulary |
| attention_mask | Binary mask indicating real tokens (1) vs. padding (0) |
| token_type_ids | Segment IDs for models that distinguish sentence pairs |