Principle:Huggingface Alignment handbook Tokenizer Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A tokenizer initialization pattern that loads pretrained tokenizers from HuggingFace Hub with optional chat template override for instruction-tuned models.
Description
Tokenizer Loading initializes the text processing component that converts raw text into token IDs for model input. In the alignment-handbook, get_tokenizer wraps AutoTokenizer.from_pretrained and adds chat template configuration: if the training config specifies a chat_template (a Jinja2 template string), it overrides the tokenizer's default template.
This is critical for alignment training because the chat template defines how multi-turn conversations are formatted into token sequences. Different models use different chat formats (ChatML, Llama format, custom templates), and the template must match the training data format.
Usage
Use this principle whenever loading a tokenizer for alignment training. The tokenizer is needed by all trainers (SFT, DPO, ORPO) as the processing_class parameter for text-to-token conversion.
Theoretical Basis
Chat templates use Jinja2 syntax to define conversation formatting:
# Abstract tokenizer loading flow (NOT real implementation)
tokenizer = load_tokenizer(model_name, revision, trust_remote_code)
if custom_chat_template:
tokenizer.chat_template = custom_chat_template
# Template converts: [{"role": "user", "content": "Hi"}] -> "<|user|>\nHi<|end|>"
The chat template is a Jinja2 string that controls:
- Role markers (e.g., <|user|>, <|assistant|>)
- Turn separators and end-of-sequence tokens
- System message handling
- Special mode tokens (e.g., <|thinking|> for SmolLM3 reasoning mode)