Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Transformers AutoTokenizer From Pretrained

From Leeroopedia
Knowledge Sources
Domains NLP, Training, Text Processing
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for instantiating a pretrained tokenizer from a model name or path, provided by the HuggingFace Transformers library.

Description

AutoTokenizer.from_pretrained() is a factory class method that automatically resolves and instantiates the correct tokenizer class for a given pretrained model. It inspects the model's configuration to determine the tokenizer type (e.g., LlamaTokenizer, GPT2Tokenizer, BertTokenizer) and loads the corresponding vocabulary files and special token mappings. The method supports loading from the HuggingFace Hub, local directories, and single vocabulary files.

As of Transformers v5, all tokenizers use the fast tokenizers backend by default (backed by the Rust-based HuggingFace Tokenizers library). An alternative sentencepiece backend can be explicitly selected.

Usage

Use AutoTokenizer.from_pretrained() whenever you need a tokenizer that matches a specific pretrained model. This should be called before applying tokenization to your dataset and before passing data to the Trainer.

Code Reference

Source Location

  • Repository: transformers
  • File: src/transformers/models/auto/tokenization_auto.py (lines 504-615)

Signature

@classmethod
def from_pretrained(
    cls, pretrained_model_name_or_path, *inputs, **kwargs
) -> TokenizersBackend | SentencePieceBackend:

Import

from transformers import AutoTokenizer

I/O Contract

Inputs

Name Type Required Description
pretrained_model_name_or_path str or os.PathLike Yes Model ID on the HuggingFace Hub (e.g., "google-bert/bert-base-uncased") or path to a local directory containing vocabulary files
config PreTrainedConfig No Configuration object to determine the tokenizer class. If not provided, it is loaded from the model path
cache_dir str or os.PathLike No Directory to cache downloaded tokenizer files
force_download bool No Whether to re-download files even if cached (defaults to False)
revision str No Model version to use -- branch name, tag, or commit ID (defaults to "main")
tokenizer_type str No Explicit tokenizer type to load, bypassing auto-detection
backend str No Backend for tokenization: "tokenizers" (default, Rust-based) or "sentencepiece"
trust_remote_code bool No Whether to allow custom tokenizer code from the Hub (defaults to False)
**kwargs dict No Additional arguments passed to the tokenizer __init__(), including special tokens (bos_token, eos_token, pad_token, etc.)

Outputs

Name Type Description
tokenizer PreTrainedTokenizer An instantiated tokenizer matching the pretrained model, ready to encode and decode text

Usage Examples

Basic Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"])

Tokenizing a Dataset with dataset.map()

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

dataset = load_dataset("imdb")
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Using a Specific Backend

from transformers import AutoTokenizer

# Use the sentencepiece backend explicitly
tokenizer = AutoTokenizer.from_pretrained(
    "hf-internal-testing/llama-tokenizer",
    backend="sentencepiece"
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment