Implementation:Axolotl ai cloud Axolotl Load Tokenizer

Knowledge Sources	Axolotl HuggingFace AutoTokenizer
Domains	NLP, Tokenization
Last Updated	2026-02-06 23:00 GMT

Overview

Concrete tool for loading and configuring tokenizers with special tokens, chat templates, and model-specific patches provided by the Axolotl framework.

Description

The load_tokenizer function loads a tokenizer from HuggingFace and applies extensive configuration. It handles: tokenizer class selection (AutoTokenizer or explicit type), fast vs. slow tokenizer selection, legacy mode toggle, special token configuration (pad, BOS, EOS, UNK), additional token registration, chat template application, Mistral common tokenizer support, and pre-tokenizer-load patches via PatchManager.

The function also handles model-specific quirks like models that need their EOS token overridden or tokenizers that require special handling for BOS token stripping.

Usage

Called at the start of every training pipeline. The tokenizer is used for dataset tokenization and is also passed to the model loader for embedding resizing.

Code Reference

Source Location

Repository: axolotl
File: src/axolotl/loaders/tokenizer.py
Lines: L124-310

Signature

def load_tokenizer(cfg: DictDefault) -> PreTrainedTokenizer:
    """Load and configure the tokenizer based on the provided config.

    Args:
        cfg: Config with tokenizer_config (path/model ID), tokenizer_type,
             tokenizer_use_fast, special_tokens, tokens, chat_template,
             tokenizer_use_mistral_common, tokenizer_legacy.

    Returns:
        PreTrainedTokenizer: Fully configured tokenizer with special tokens,
        chat template, and model-specific patches applied.
    """

Import

from axolotl.loaders.tokenizer import load_tokenizer

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictDefault	Yes	Config with tokenizer_config, tokenizer_type, tokenizer_use_fast, special_tokens (dict), tokens (list), chat_template, tokenizer_use_mistral_common, tokenizer_legacy

Outputs

Name	Type	Description
return	PreTrainedTokenizer	Configured tokenizer with special tokens, chat template, additional tokens, and model-specific patches applied

Usage Examples

Basic Tokenizer Loading

from axolotl.loaders.tokenizer import load_tokenizer

# cfg.tokenizer_config = "meta-llama/Llama-3.2-1B"
# cfg.chat_template = "llama3"
tokenizer = load_tokenizer(cfg)

print(tokenizer.pad_token)        # "<|finetune_right_pad_id|>"
print(tokenizer.chat_template)    # Llama3 chat template
print(len(tokenizer))             # Vocabulary size

Tokenizer with Custom Special Tokens

# cfg.special_tokens = {
#     "pad_token": "<|pad|>",
#     "bos_token": "<|begin_of_text|>",
# }
# cfg.tokens = ["<|custom_tag|>", "<|end_custom|>"]
tokenizer = load_tokenizer(cfg)

Related Pages

Implements Principle

Principle:Axolotl_ai_cloud_Axolotl_Tokenizer_Configuration

Requires Environment

Environment:Axolotl_ai_cloud_Axolotl_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment