Implementation:Huggingface Transformers AutoTokenizer From Pretrained
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Text Processing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for instantiating a pretrained tokenizer from a model name or path, provided by the HuggingFace Transformers library.
Description
AutoTokenizer.from_pretrained() is a factory class method that automatically resolves and instantiates the correct tokenizer class for a given pretrained model. It inspects the model's configuration to determine the tokenizer type (e.g., LlamaTokenizer, GPT2Tokenizer, BertTokenizer) and loads the corresponding vocabulary files and special token mappings. The method supports loading from the HuggingFace Hub, local directories, and single vocabulary files.
As of Transformers v5, all tokenizers use the fast tokenizers backend by default (backed by the Rust-based HuggingFace Tokenizers library). An alternative sentencepiece backend can be explicitly selected.
Usage
Use AutoTokenizer.from_pretrained() whenever you need a tokenizer that matches a specific pretrained model. This should be called before applying tokenization to your dataset and before passing data to the Trainer.
Code Reference
Source Location
- Repository: transformers
- File: src/transformers/models/auto/tokenization_auto.py (lines 504-615)
Signature
@classmethod
def from_pretrained(
cls, pretrained_model_name_or_path, *inputs, **kwargs
) -> TokenizersBackend | SentencePieceBackend:
Import
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pretrained_model_name_or_path | str or os.PathLike | Yes | Model ID on the HuggingFace Hub (e.g., "google-bert/bert-base-uncased") or path to a local directory containing vocabulary files |
| config | PreTrainedConfig | No | Configuration object to determine the tokenizer class. If not provided, it is loaded from the model path |
| cache_dir | str or os.PathLike | No | Directory to cache downloaded tokenizer files |
| force_download | bool | No | Whether to re-download files even if cached (defaults to False) |
| revision | str | No | Model version to use -- branch name, tag, or commit ID (defaults to "main") |
| tokenizer_type | str | No | Explicit tokenizer type to load, bypassing auto-detection |
| backend | str | No | Backend for tokenization: "tokenizers" (default, Rust-based) or "sentencepiece" |
| trust_remote_code | bool | No | Whether to allow custom tokenizer code from the Hub (defaults to False) |
| **kwargs | dict | No | Additional arguments passed to the tokenizer __init__(), including special tokens (bos_token, eos_token, pad_token, etc.) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenizer | PreTrainedTokenizer | An instantiated tokenizer matching the pretrained model, ready to encode and decode text |
Usage Examples
Basic Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"])
Tokenizing a Dataset with dataset.map()
from transformers import AutoTokenizer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
dataset = load_dataset("imdb")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
Using a Specific Backend
from transformers import AutoTokenizer
# Use the sentencepiece backend explicitly
tokenizer = AutoTokenizer.from_pretrained(
"hf-internal-testing/llama-tokenizer",
backend="sentencepiece"
)