Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bigscience workshop Petals AutoTokenizer From Pretrained

From Leeroopedia


Knowledge Sources
Domains NLP, Preprocessing
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for loading model-matched tokenizers provided by the HuggingFace Transformers library, used in Petals for client-side text encoding and decoding.

Description

AutoTokenizer.from_pretrained is a HuggingFace Transformers auto-class that downloads and instantiates the correct tokenizer for a given model. In Petals workflows, the tokenizer is always loaded from the same model repository as the distributed model to ensure vocabulary compatibility. The tokenizer runs entirely on the client — no remote computation is needed.

Key Petals-specific considerations:

  • The model_name_or_path argument must match what was passed to AutoDistributedModelForCausalLM.from_pretrained
  • For batch generation, set padding_side="left" to properly handle variable-length inputs
  • The tokenizer handles both encoding (text to input_ids) and decoding (generated token IDs to text)

Usage

Import this whenever you need to convert text to token IDs for model input, or decode generated tokens back to text. Always load the tokenizer from the same model checkpoint used for the distributed model.

Code Reference

Source Location

  • Repository: transformers (external)
  • File: External: transformers.AutoTokenizer

Signature

class AutoTokenizer:
    @classmethod
    def from_pretrained(
        cls,
        pretrained_model_name_or_path: str,
        *inputs,
        **kwargs,
    ) -> PreTrainedTokenizer:
        """
        Instantiate one of the tokenizer classes from a pretrained model vocabulary.

        Args:
            pretrained_model_name_or_path: Model name or path (same as used for model loading)
            use_fast: Whether to use Rust-based fast tokenizer (default: True)
            padding_side: "left" or "right" for batch padding direction
        """

Import

from transformers import AutoTokenizer

I/O Contract

Inputs

Name Type Required Description
pretrained_model_name_or_path str Yes HuggingFace model name (must match the distributed model)
padding_side str No "left" for batch generation, "right" for training
use_fast bool No Use Rust-based fast tokenizer (default True)
token Optional[str] No HuggingFace auth token for gated models

Outputs

Name Type Description
tokenizer PreTrainedTokenizer Tokenizer instance capable of encode/decode operations

Usage Examples

Basic Tokenization for Petals

from transformers import AutoTokenizer

model_name = "petals-team/StableBeluga2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Encode text to token IDs
inputs = tokenizer("What is the capital of France?", return_tensors="pt")
# inputs["input_ids"] shape: [1, N]

# After generation, decode token IDs back to text
generated_ids = [1, 1724, 7483, 310, 3444, 338, 3681, 29889]
text = tokenizer.decode(generated_ids, skip_special_tokens=True)

Batch Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token  # Required for padding

texts = ["Hello, how are you?", "What is AI?"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
# inputs["input_ids"] shape: [2, max_len]
# inputs["attention_mask"] shape: [2, max_len]

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment