Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Alignment handbook Get Tokenizer

From Leeroopedia


Knowledge Sources
Domains NLP, Preprocessing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading pretrained tokenizers with optional chat template override, provided by the alignment-handbook library.

Description

The get_tokenizer function loads a tokenizer via AutoTokenizer.from_pretrained and optionally overrides its chat template with a custom Jinja2 template from the training configuration. This ensures the tokenizer's conversation formatting matches the expected training data format.

Usage

Import this function whenever a tokenizer is needed for alignment training. It is called in every training script alongside get_model.

Code Reference

Source Location

Signature

def get_tokenizer(model_args: ModelConfig, training_args: SFTConfig) -> PreTrainedTokenizer:
    """Get the tokenizer for the model.

    Args:
        model_args (ModelConfig): Model configuration with model_name_or_path,
            model_revision, and trust_remote_code.
        training_args (SFTConfig): Training configuration with optional
            chat_template field.

    Returns:
        PreTrainedTokenizer: The loaded tokenizer with chat template applied.
    """

Import

from alignment import get_tokenizer
from trl import ModelConfig

I/O Contract

Inputs

Name Type Required Description
model_args ModelConfig Yes Model configuration from TRL
model_args.model_name_or_path str Yes HuggingFace model ID or local path for the tokenizer
model_args.model_revision str No Git revision for the tokenizer
model_args.trust_remote_code bool No Whether to trust remote code for custom tokenizers
training_args SFTConfig Yes Training configuration
training_args.chat_template Optional[str] No Jinja2 chat template string to override the tokenizer default

Outputs

Name Type Description
return PreTrainedTokenizer Loaded tokenizer with chat template set (either from model default or overridden by config)

Usage Examples

Standard Tokenizer Loading

from alignment import get_tokenizer

tokenizer = get_tokenizer(model_args, training_args)

# Apply chat template to conversation
messages = [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)

With Chat Template Fallback (SFT Script)

from alignment import get_model, get_tokenizer
from trl import setup_chat_format

tokenizer = get_tokenizer(model_args, training_args)
model = get_model(model_args, training_args)

# If the tokenizer has no chat template, fall back to ChatML
if tokenizer.chat_template is None:
    model, tokenizer = setup_chat_format(model, tokenizer, format="chatml")

DPO/ORPO Padding Token Setup

from alignment import get_tokenizer

tokenizer = get_tokenizer(model_args, training_args)

# DPO and ORPO scripts ensure pad_token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment