Implementation:Axolotl ai cloud Axolotl Load Tokenizer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
Concrete tool for loading and configuring tokenizers with special tokens, chat templates, and model-specific patches provided by the Axolotl framework.
Description
The load_tokenizer function loads a tokenizer from HuggingFace and applies extensive configuration. It handles: tokenizer class selection (AutoTokenizer or explicit type), fast vs. slow tokenizer selection, legacy mode toggle, special token configuration (pad, BOS, EOS, UNK), additional token registration, chat template application, Mistral common tokenizer support, and pre-tokenizer-load patches via PatchManager.
The function also handles model-specific quirks like models that need their EOS token overridden or tokenizers that require special handling for BOS token stripping.
Usage
Called at the start of every training pipeline. The tokenizer is used for dataset tokenization and is also passed to the model loader for embedding resizing.
Code Reference
Source Location
- Repository: axolotl
- File: src/axolotl/loaders/tokenizer.py
- Lines: L124-310
Signature
def load_tokenizer(cfg: DictDefault) -> PreTrainedTokenizer:
"""Load and configure the tokenizer based on the provided config.
Args:
cfg: Config with tokenizer_config (path/model ID), tokenizer_type,
tokenizer_use_fast, special_tokens, tokens, chat_template,
tokenizer_use_mistral_common, tokenizer_legacy.
Returns:
PreTrainedTokenizer: Fully configured tokenizer with special tokens,
chat template, and model-specific patches applied.
"""
Import
from axolotl.loaders.tokenizer import load_tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | DictDefault | Yes | Config with tokenizer_config, tokenizer_type, tokenizer_use_fast, special_tokens (dict), tokens (list), chat_template, tokenizer_use_mistral_common, tokenizer_legacy |
Outputs
| Name | Type | Description |
|---|---|---|
| return | PreTrainedTokenizer | Configured tokenizer with special tokens, chat template, additional tokens, and model-specific patches applied |
Usage Examples
Basic Tokenizer Loading
from axolotl.loaders.tokenizer import load_tokenizer
# cfg.tokenizer_config = "meta-llama/Llama-3.2-1B"
# cfg.chat_template = "llama3"
tokenizer = load_tokenizer(cfg)
print(tokenizer.pad_token) # "<|finetune_right_pad_id|>"
print(tokenizer.chat_template) # Llama3 chat template
print(len(tokenizer)) # Vocabulary size
Tokenizer with Custom Special Tokens
# cfg.special_tokens = {
# "pad_token": "<|pad|>",
# "bos_token": "<|begin_of_text|>",
# }
# cfg.tokens = ["<|custom_tag|>", "<|end_custom|>"]
tokenizer = load_tokenizer(cfg)