Implementation:Lucidrains X transformers EntropyBasedTokenizer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization, Preprocessing |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Concrete tool for segmenting byte or token sequences into variable-length tokens based on prediction entropy thresholds provided by the x-transformers library.
Description
The EntropyBasedTokenizer implements entropy-based dynamic tokenization as described in the Byte Latent Transformer paper (Meta, 2024). Given a pre-trained decoder model and a sequence, it computes the prediction entropy at each position. Positions where entropy exceeds a threshold are marked as token boundaries, creating variable-length segments. High-entropy positions correspond to "surprising" tokens that are good boundary points. The tokenizer also supports a maximum token size constraint to prevent very long segments from repeating subsequences. It can return either token lengths or the actual segmented subsequences.
Usage
Import this class when you need adaptive tokenization that creates fewer tokens for predictable regions and more tokens for surprising/complex regions. This is useful for byte-level models, adaptive computation, or any application where variable-length grouping of sequence elements improves efficiency.
Code Reference
Source Location
- Repository: Lucidrains_X_transformers
- File: x_transformers/entropy_based_tokenizer.py
- Lines: 33-167
Signature
class EntropyBasedTokenizer(Module):
def __init__(
self,
decoder: Module,
entropy_threshold: float,
max_token_size: int | None = None
):
"""
Args:
decoder: Pre-trained decoder model whose output logits determine entropy.
entropy_threshold: Entropy value above which a boundary is placed.
max_token_size: Maximum allowed token size (prevents excessively long tokens).
"""
Import
from x_transformers.entropy_based_tokenizer import EntropyBasedTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| seq | Tensor (b, n) or (n,) | Yes | Input token/byte sequence |
| lens | Tensor (b,) | No | Actual lengths for variable-length batches |
| return_segmented_seq | bool | No | If True, return actual segmented subsequences instead of lengths |
| decoder_forward_kwargs | dict | No | Additional kwargs passed to the decoder forward |
Outputs
| Name | Type | Description |
|---|---|---|
| forward() default | Tensor (b, max_tokens) | Token lengths, zero-padded |
| forward() with return_segmented_seq | list of list of Tensor | Nested list of segmented subsequences per batch element |
Usage Examples
Basic Entropy Tokenization
import torch
from x_transformers import TransformerWrapper, Decoder
from x_transformers.entropy_based_tokenizer import EntropyBasedTokenizer
# Use a pre-trained small decoder for entropy estimation
decoder = TransformerWrapper(
num_tokens=256,
max_seq_len=512,
attn_layers=Decoder(dim=128, depth=4, heads=4)
)
tokenizer = EntropyBasedTokenizer(
decoder=decoder,
entropy_threshold=2.0,
max_token_size=8
)
# Tokenize a byte sequence
seq = torch.randint(0, 256, (2, 100))
token_lengths = tokenizer(seq)
# token_lengths: shape (2, num_tokens) with the length of each variable-size token
Get Segmented Subsequences
# Get the actual segmented sequences
segments = tokenizer(seq, return_segmented_seq=True)
# segments[0] is a tuple of tensors, each being one variable-length token
for i, token in enumerate(segments[0]):
print(f"Token {i}: length={len(token)}, values={token}")