Implementation:Bigscience workshop Petals Tokenizer Decode

Knowledge Sources	HuggingFace Transformers Tokenizer Petals
Domains	NLP, Postprocessing
Last Updated	2026-02-09 14:00 GMT

Overview

Concrete tool for converting generated token IDs back to text strings provided by the HuggingFace Transformers tokenizer, used as the final output step in Petals generation workflows.

Description

tokenizer.decode() is a method on HuggingFace's PreTrainedTokenizer that converts a sequence of token IDs back into a human-readable string. In Petals workflows, this is called after model.generate() returns the generated token ID tensor.

The method handles subword merging, special token filtering, and whitespace normalization specific to each tokenizer type (BPE, SentencePiece, etc.).

Usage

Use this method as the final step after any generation call to convert the output tensor to readable text. Set skip_special_tokens=True to remove control tokens from the output.

Code Reference

Source Location

Repository: transformers (external)
File: External: transformers.PreTrainedTokenizer.decode

Signature

class PreTrainedTokenizer:
    def decode(
        self,
        token_ids: Union[int, List[int]],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: Optional[bool] = None,
        **kwargs,
    ) -> str:
        """
        Convert token IDs back to a string.

        Args:
            token_ids: Token ID(s) to decode
            skip_special_tokens: Whether to remove special tokens (<s>, </s>, <pad>)
            clean_up_tokenization_spaces: Whether to clean up extra spaces
        Returns:
            Decoded text string
        """

Import

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = tokenizer.decode(token_ids, skip_special_tokens=True)

I/O Contract

Inputs

Name	Type	Required	Description
token_ids	Union[int, List[int]]	Yes	Generated token IDs from model.generate() output
skip_special_tokens	bool	No	Remove special tokens from output (default False, typically set True)

Outputs

Name	Type	Description
text	str	Human-readable decoded text string

Usage Examples

Basic Decoding

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("petals-team/StableBeluga2")

# After model.generate() returns output tensor
generated_ids = outputs[0].tolist()  # Convert tensor to list

# Decode to text, removing special tokens
text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(text)

Batch Decoding

# Decode multiple sequences at once
texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for text in texts:
    print(text)

Related Pages

Implements Principle

Principle:Bigscience_workshop_Petals_Output_Decoding

Requires Environment

Environment:Bigscience_workshop_Petals_Python_Transformers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment