Implementation:Bigscience workshop Petals AutoTokenizer From Pretrained
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for loading model-matched tokenizers provided by the HuggingFace Transformers library, used in Petals for client-side text encoding and decoding.
Description
AutoTokenizer.from_pretrained is a HuggingFace Transformers auto-class that downloads and instantiates the correct tokenizer for a given model. In Petals workflows, the tokenizer is always loaded from the same model repository as the distributed model to ensure vocabulary compatibility. The tokenizer runs entirely on the client — no remote computation is needed.
Key Petals-specific considerations:
- The model_name_or_path argument must match what was passed to AutoDistributedModelForCausalLM.from_pretrained
- For batch generation, set padding_side="left" to properly handle variable-length inputs
- The tokenizer handles both encoding (text to input_ids) and decoding (generated token IDs to text)
Usage
Import this whenever you need to convert text to token IDs for model input, or decode generated tokens back to text. Always load the tokenizer from the same model checkpoint used for the distributed model.
Code Reference
Source Location
- Repository: transformers (external)
- File: External: transformers.AutoTokenizer
Signature
class AutoTokenizer:
@classmethod
def from_pretrained(
cls,
pretrained_model_name_or_path: str,
*inputs,
**kwargs,
) -> PreTrainedTokenizer:
"""
Instantiate one of the tokenizer classes from a pretrained model vocabulary.
Args:
pretrained_model_name_or_path: Model name or path (same as used for model loading)
use_fast: Whether to use Rust-based fast tokenizer (default: True)
padding_side: "left" or "right" for batch padding direction
"""
Import
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pretrained_model_name_or_path | str | Yes | HuggingFace model name (must match the distributed model) |
| padding_side | str | No | "left" for batch generation, "right" for training |
| use_fast | bool | No | Use Rust-based fast tokenizer (default True) |
| token | Optional[str] | No | HuggingFace auth token for gated models |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenizer | PreTrainedTokenizer | Tokenizer instance capable of encode/decode operations |
Usage Examples
Basic Tokenization for Petals
from transformers import AutoTokenizer
model_name = "petals-team/StableBeluga2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Encode text to token IDs
inputs = tokenizer("What is the capital of France?", return_tensors="pt")
# inputs["input_ids"] shape: [1, N]
# After generation, decode token IDs back to text
generated_ids = [1, 1724, 7483, 310, 3444, 338, 3681, 29889]
text = tokenizer.decode(generated_ids, skip_special_tokens=True)
Batch Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token # Required for padding
texts = ["Hello, how are you?", "What is AI?"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
# inputs["input_ids"] shape: [2, max_len]
# inputs["attention_mask"] shape: [2, max_len]