Implementation:Lucidrains X transformers EntropyBasedTokenizer

Knowledge Sources	Lucidrains_X_transformers Byte Latent Transformer
Domains	NLP, Tokenization, Preprocessing
Last Updated	2026-02-08 18:00 GMT

Overview

Concrete tool for segmenting byte or token sequences into variable-length tokens based on prediction entropy thresholds provided by the x-transformers library.

Description

The EntropyBasedTokenizer implements entropy-based dynamic tokenization as described in the Byte Latent Transformer paper (Meta, 2024). Given a pre-trained decoder model and a sequence, it computes the prediction entropy at each position. Positions where entropy exceeds a threshold are marked as token boundaries, creating variable-length segments. High-entropy positions correspond to "surprising" tokens that are good boundary points. The tokenizer also supports a maximum token size constraint to prevent very long segments from repeating subsequences. It can return either token lengths or the actual segmented subsequences.

Usage

Import this class when you need adaptive tokenization that creates fewer tokens for predictable regions and more tokens for surprising/complex regions. This is useful for byte-level models, adaptive computation, or any application where variable-length grouping of sequence elements improves efficiency.

Code Reference

Source Location

Repository: Lucidrains_X_transformers
File: x_transformers/entropy_based_tokenizer.py
Lines: 33-167

Signature

class EntropyBasedTokenizer(Module):
    def __init__(
        self,
        decoder: Module,
        entropy_threshold: float,
        max_token_size: int | None = None
    ):
        """
        Args:
            decoder: Pre-trained decoder model whose output logits determine entropy.
            entropy_threshold: Entropy value above which a boundary is placed.
            max_token_size: Maximum allowed token size (prevents excessively long tokens).
        """

Import

from x_transformers.entropy_based_tokenizer import EntropyBasedTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
seq	Tensor (b, n) or (n,)	Yes	Input token/byte sequence
lens	Tensor (b,)	No	Actual lengths for variable-length batches
return_segmented_seq	bool	No	If True, return actual segmented subsequences instead of lengths
decoder_forward_kwargs	dict	No	Additional kwargs passed to the decoder forward

Outputs

Name	Type	Description
forward() default	Tensor (b, max_tokens)	Token lengths, zero-padded
forward() with return_segmented_seq	list of list of Tensor	Nested list of segmented subsequences per batch element

Usage Examples

Basic Entropy Tokenization

import torch
from x_transformers import TransformerWrapper, Decoder
from x_transformers.entropy_based_tokenizer import EntropyBasedTokenizer

# Use a pre-trained small decoder for entropy estimation
decoder = TransformerWrapper(
    num_tokens=256,
    max_seq_len=512,
    attn_layers=Decoder(dim=128, depth=4, heads=4)
)

tokenizer = EntropyBasedTokenizer(
    decoder=decoder,
    entropy_threshold=2.0,
    max_token_size=8
)

# Tokenize a byte sequence
seq = torch.randint(0, 256, (2, 100))
token_lengths = tokenizer(seq)
# token_lengths: shape (2, num_tokens) with the length of each variable-size token

Get Segmented Subsequences

# Get the actual segmented sequences
segments = tokenizer(seq, return_segmented_seq=True)
# segments[0] is a tuple of tensors, each being one variable-length token
for i, token in enumerate(segments[0]):
    print(f"Token {i}: length={len(token)}, values={token}")

Related Pages

Environment:Lucidrains_X_transformers_PyTorch_CUDA

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment