Implementation:Huggingface Transformers AutoTokenizer From Pretrained

Knowledge Sources	Transformers Transformers Docs
Domains	NLP, Training, Text Processing
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for instantiating a pretrained tokenizer from a model name or path, provided by the HuggingFace Transformers library.

Description

AutoTokenizer.from_pretrained() is a factory class method that automatically resolves and instantiates the correct tokenizer class for a given pretrained model. It inspects the model's configuration to determine the tokenizer type (e.g., LlamaTokenizer, GPT2Tokenizer, BertTokenizer) and loads the corresponding vocabulary files and special token mappings. The method supports loading from the HuggingFace Hub, local directories, and single vocabulary files.

As of Transformers v5, all tokenizers use the fast tokenizers backend by default (backed by the Rust-based HuggingFace Tokenizers library). An alternative sentencepiece backend can be explicitly selected.

Usage

Use AutoTokenizer.from_pretrained() whenever you need a tokenizer that matches a specific pretrained model. This should be called before applying tokenization to your dataset and before passing data to the Trainer.

Code Reference

Source Location

Repository: transformers
File: src/transformers/models/auto/tokenization_auto.py (lines 504-615)

Signature

@classmethod
def from_pretrained(
    cls, pretrained_model_name_or_path, *inputs, **kwargs
) -> TokenizersBackend | SentencePieceBackend:

Import

from transformers import AutoTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
pretrained_model_name_or_path	str or os.PathLike	Yes	Model ID on the HuggingFace Hub (e.g., "google-bert/bert-base-uncased") or path to a local directory containing vocabulary files
config	PreTrainedConfig	No	Configuration object to determine the tokenizer class. If not provided, it is loaded from the model path
cache_dir	str or os.PathLike	No	Directory to cache downloaded tokenizer files
force_download	bool	No	Whether to re-download files even if cached (defaults to False)
revision	str	No	Model version to use -- branch name, tag, or commit ID (defaults to "main")
tokenizer_type	str	No	Explicit tokenizer type to load, bypassing auto-detection
backend	str	No	Backend for tokenization: "tokenizers" (default, Rust-based) or "sentencepiece"
trust_remote_code	bool	No	Whether to allow custom tokenizer code from the Hub (defaults to False)
**kwargs	dict	No	Additional arguments passed to the tokenizer __init__(), including special tokens (bos_token, eos_token, pad_token, etc.)

Outputs

Name	Type	Description
tokenizer	PreTrainedTokenizer	An instantiated tokenizer matching the pretrained model, ready to encode and decode text

Usage Examples

Basic Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"])

Tokenizing a Dataset with dataset.map()

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

dataset = load_dataset("imdb")
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Using a Specific Backend

from transformers import AutoTokenizer

# Use the sentencepiece backend explicitly
tokenizer = AutoTokenizer.from_pretrained(
    "hf-internal-testing/llama-tokenizer",
    backend="sentencepiece"
)

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment