Implementation:Huggingface Alignment handbook Get Tokenizer

Knowledge Sources	Alignment Handbook HuggingFace Transformers
Domains	NLP, Preprocessing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading pretrained tokenizers with optional chat template override, provided by the alignment-handbook library.

Description

The get_tokenizer function loads a tokenizer via AutoTokenizer.from_pretrained and optionally overrides its chat template with a custom Jinja2 template from the training configuration. This ensures the tokenizer's conversation formatting matches the expected training data format.

Usage

Import this function whenever a tokenizer is needed for alignment training. It is called in every training script alongside get_model.

Code Reference

Source Location

Repository: alignment-handbook
File: src/alignment/model_utils.py (lines 23-34)

Signature

def get_tokenizer(model_args: ModelConfig, training_args: SFTConfig) -> PreTrainedTokenizer:
    """Get the tokenizer for the model.

    Args:
        model_args (ModelConfig): Model configuration with model_name_or_path,
            model_revision, and trust_remote_code.
        training_args (SFTConfig): Training configuration with optional
            chat_template field.

    Returns:
        PreTrainedTokenizer: The loaded tokenizer with chat template applied.
    """

Import

from alignment import get_tokenizer
from trl import ModelConfig

I/O Contract

Inputs

Name	Type	Required	Description
model_args	ModelConfig	Yes	Model configuration from TRL
model_args.model_name_or_path	str	Yes	HuggingFace model ID or local path for the tokenizer
model_args.model_revision	str	No	Git revision for the tokenizer
model_args.trust_remote_code	bool	No	Whether to trust remote code for custom tokenizers
training_args	SFTConfig	Yes	Training configuration
training_args.chat_template	Optional[str]	No	Jinja2 chat template string to override the tokenizer default

Outputs

Name	Type	Description
return	PreTrainedTokenizer	Loaded tokenizer with chat template set (either from model default or overridden by config)

Usage Examples

Standard Tokenizer Loading

from alignment import get_tokenizer

tokenizer = get_tokenizer(model_args, training_args)

# Apply chat template to conversation
messages = [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)

With Chat Template Fallback (SFT Script)

from alignment import get_model, get_tokenizer
from trl import setup_chat_format

tokenizer = get_tokenizer(model_args, training_args)
model = get_model(model_args, training_args)

# If the tokenizer has no chat template, fall back to ChatML
if tokenizer.chat_template is None:
    model, tokenizer = setup_chat_format(model, tokenizer, format="chatml")

DPO/ORPO Padding Token Setup

from alignment import get_tokenizer

tokenizer = get_tokenizer(model_args, training_args)

# DPO and ORPO scripts ensure pad_token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Related Pages

Implements Principle

Principle:Huggingface_Alignment_handbook_Tokenizer_Loading

Requires Environment

Environment:Huggingface_Alignment_handbook_Python_Transformers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment