Implementation:Bigscience workshop Petals Tokenizer Decode
| Knowledge Sources | |
|---|---|
| Domains | NLP, Postprocessing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for converting generated token IDs back to text strings provided by the HuggingFace Transformers tokenizer, used as the final output step in Petals generation workflows.
Description
tokenizer.decode() is a method on HuggingFace's PreTrainedTokenizer that converts a sequence of token IDs back into a human-readable string. In Petals workflows, this is called after model.generate() returns the generated token ID tensor.
The method handles subword merging, special token filtering, and whitespace normalization specific to each tokenizer type (BPE, SentencePiece, etc.).
Usage
Use this method as the final step after any generation call to convert the output tensor to readable text. Set skip_special_tokens=True to remove control tokens from the output.
Code Reference
Source Location
- Repository: transformers (external)
- File: External: transformers.PreTrainedTokenizer.decode
Signature
class PreTrainedTokenizer:
def decode(
self,
token_ids: Union[int, List[int]],
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: Optional[bool] = None,
**kwargs,
) -> str:
"""
Convert token IDs back to a string.
Args:
token_ids: Token ID(s) to decode
skip_special_tokens: Whether to remove special tokens (<s>, </s>, <pad>)
clean_up_tokenization_spaces: Whether to clean up extra spaces
Returns:
Decoded text string
"""
Import
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = tokenizer.decode(token_ids, skip_special_tokens=True)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| token_ids | Union[int, List[int]] | Yes | Generated token IDs from model.generate() output |
| skip_special_tokens | bool | No | Remove special tokens from output (default False, typically set True) |
Outputs
| Name | Type | Description |
|---|---|---|
| text | str | Human-readable decoded text string |
Usage Examples
Basic Decoding
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("petals-team/StableBeluga2")
# After model.generate() returns output tensor
generated_ids = outputs[0].tolist() # Convert tensor to list
# Decode to text, removing special tokens
text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(text)
Batch Decoding
# Decode multiple sequences at once
texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for text in texts:
print(text)