Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Axolotl ai cloud Axolotl Load Tokenizer

From Leeroopedia


Knowledge Sources
Domains NLP, Tokenization
Last Updated 2026-02-06 23:00 GMT

Overview

Concrete tool for loading and configuring tokenizers with special tokens, chat templates, and model-specific patches provided by the Axolotl framework.

Description

The load_tokenizer function loads a tokenizer from HuggingFace and applies extensive configuration. It handles: tokenizer class selection (AutoTokenizer or explicit type), fast vs. slow tokenizer selection, legacy mode toggle, special token configuration (pad, BOS, EOS, UNK), additional token registration, chat template application, Mistral common tokenizer support, and pre-tokenizer-load patches via PatchManager.

The function also handles model-specific quirks like models that need their EOS token overridden or tokenizers that require special handling for BOS token stripping.

Usage

Called at the start of every training pipeline. The tokenizer is used for dataset tokenization and is also passed to the model loader for embedding resizing.

Code Reference

Source Location

  • Repository: axolotl
  • File: src/axolotl/loaders/tokenizer.py
  • Lines: L124-310

Signature

def load_tokenizer(cfg: DictDefault) -> PreTrainedTokenizer:
    """Load and configure the tokenizer based on the provided config.

    Args:
        cfg: Config with tokenizer_config (path/model ID), tokenizer_type,
             tokenizer_use_fast, special_tokens, tokens, chat_template,
             tokenizer_use_mistral_common, tokenizer_legacy.

    Returns:
        PreTrainedTokenizer: Fully configured tokenizer with special tokens,
        chat template, and model-specific patches applied.
    """

Import

from axolotl.loaders.tokenizer import load_tokenizer

I/O Contract

Inputs

Name Type Required Description
cfg DictDefault Yes Config with tokenizer_config, tokenizer_type, tokenizer_use_fast, special_tokens (dict), tokens (list), chat_template, tokenizer_use_mistral_common, tokenizer_legacy

Outputs

Name Type Description
return PreTrainedTokenizer Configured tokenizer with special tokens, chat template, additional tokens, and model-specific patches applied

Usage Examples

Basic Tokenizer Loading

from axolotl.loaders.tokenizer import load_tokenizer

# cfg.tokenizer_config = "meta-llama/Llama-3.2-1B"
# cfg.chat_template = "llama3"
tokenizer = load_tokenizer(cfg)

print(tokenizer.pad_token)        # "<|finetune_right_pad_id|>"
print(tokenizer.chat_template)    # Llama3 chat template
print(len(tokenizer))             # Vocabulary size

Tokenizer with Custom Special Tokens

# cfg.special_tokens = {
#     "pad_token": "<|pad|>",
#     "bos_token": "<|begin_of_text|>",
# }
# cfg.tokens = ["<|custom_tag|>", "<|end_custom|>"]
tokenizer = load_tokenizer(cfg)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment