Implementation:Huggingface Alignment handbook Get Tokenizer
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading pretrained tokenizers with optional chat template override, provided by the alignment-handbook library.
Description
The get_tokenizer function loads a tokenizer via AutoTokenizer.from_pretrained and optionally overrides its chat template with a custom Jinja2 template from the training configuration. This ensures the tokenizer's conversation formatting matches the expected training data format.
Usage
Import this function whenever a tokenizer is needed for alignment training. It is called in every training script alongside get_model.
Code Reference
Source Location
- Repository: alignment-handbook
- File: src/alignment/model_utils.py (lines 23-34)
Signature
def get_tokenizer(model_args: ModelConfig, training_args: SFTConfig) -> PreTrainedTokenizer:
"""Get the tokenizer for the model.
Args:
model_args (ModelConfig): Model configuration with model_name_or_path,
model_revision, and trust_remote_code.
training_args (SFTConfig): Training configuration with optional
chat_template field.
Returns:
PreTrainedTokenizer: The loaded tokenizer with chat template applied.
"""
Import
from alignment import get_tokenizer
from trl import ModelConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelConfig | Yes | Model configuration from TRL |
| model_args.model_name_or_path | str | Yes | HuggingFace model ID or local path for the tokenizer |
| model_args.model_revision | str | No | Git revision for the tokenizer |
| model_args.trust_remote_code | bool | No | Whether to trust remote code for custom tokenizers |
| training_args | SFTConfig | Yes | Training configuration |
| training_args.chat_template | Optional[str] | No | Jinja2 chat template string to override the tokenizer default |
Outputs
| Name | Type | Description |
|---|---|---|
| return | PreTrainedTokenizer | Loaded tokenizer with chat template set (either from model default or overridden by config) |
Usage Examples
Standard Tokenizer Loading
from alignment import get_tokenizer
tokenizer = get_tokenizer(model_args, training_args)
# Apply chat template to conversation
messages = [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)
With Chat Template Fallback (SFT Script)
from alignment import get_model, get_tokenizer
from trl import setup_chat_format
tokenizer = get_tokenizer(model_args, training_args)
model = get_model(model_args, training_args)
# If the tokenizer has no chat template, fall back to ChatML
if tokenizer.chat_template is None:
model, tokenizer = setup_chat_format(model, tokenizer, format="chatml")
DPO/ORPO Padding Token Setup
from alignment import get_tokenizer
tokenizer = get_tokenizer(model_args, training_args)
# DPO and ORPO scripts ensure pad_token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment