Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bigscience workshop Petals Tokenizer Decode

From Leeroopedia


Knowledge Sources
Domains NLP, Postprocessing
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for converting generated token IDs back to text strings provided by the HuggingFace Transformers tokenizer, used as the final output step in Petals generation workflows.

Description

tokenizer.decode() is a method on HuggingFace's PreTrainedTokenizer that converts a sequence of token IDs back into a human-readable string. In Petals workflows, this is called after model.generate() returns the generated token ID tensor.

The method handles subword merging, special token filtering, and whitespace normalization specific to each tokenizer type (BPE, SentencePiece, etc.).

Usage

Use this method as the final step after any generation call to convert the output tensor to readable text. Set skip_special_tokens=True to remove control tokens from the output.

Code Reference

Source Location

  • Repository: transformers (external)
  • File: External: transformers.PreTrainedTokenizer.decode

Signature

class PreTrainedTokenizer:
    def decode(
        self,
        token_ids: Union[int, List[int]],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: Optional[bool] = None,
        **kwargs,
    ) -> str:
        """
        Convert token IDs back to a string.

        Args:
            token_ids: Token ID(s) to decode
            skip_special_tokens: Whether to remove special tokens (<s>, </s>, <pad>)
            clean_up_tokenization_spaces: Whether to clean up extra spaces
        Returns:
            Decoded text string
        """

Import

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = tokenizer.decode(token_ids, skip_special_tokens=True)

I/O Contract

Inputs

Name Type Required Description
token_ids Union[int, List[int]] Yes Generated token IDs from model.generate() output
skip_special_tokens bool No Remove special tokens from output (default False, typically set True)

Outputs

Name Type Description
text str Human-readable decoded text string

Usage Examples

Basic Decoding

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("petals-team/StableBeluga2")

# After model.generate() returns output tensor
generated_ids = outputs[0].tolist()  # Convert tensor to list

# Decode to text, removing special tokens
text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(text)

Batch Decoding

# Decode multiple sequences at once
texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for text in texts:
    print(text)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment