Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lucidrains X transformers EntropyBasedTokenizer

From Leeroopedia


Knowledge Sources
Domains NLP, Tokenization, Preprocessing
Last Updated 2026-02-08 18:00 GMT

Overview

Concrete tool for segmenting byte or token sequences into variable-length tokens based on prediction entropy thresholds provided by the x-transformers library.

Description

The EntropyBasedTokenizer implements entropy-based dynamic tokenization as described in the Byte Latent Transformer paper (Meta, 2024). Given a pre-trained decoder model and a sequence, it computes the prediction entropy at each position. Positions where entropy exceeds a threshold are marked as token boundaries, creating variable-length segments. High-entropy positions correspond to "surprising" tokens that are good boundary points. The tokenizer also supports a maximum token size constraint to prevent very long segments from repeating subsequences. It can return either token lengths or the actual segmented subsequences.

Usage

Import this class when you need adaptive tokenization that creates fewer tokens for predictable regions and more tokens for surprising/complex regions. This is useful for byte-level models, adaptive computation, or any application where variable-length grouping of sequence elements improves efficiency.

Code Reference

Source Location

Signature

class EntropyBasedTokenizer(Module):
    def __init__(
        self,
        decoder: Module,
        entropy_threshold: float,
        max_token_size: int | None = None
    ):
        """
        Args:
            decoder: Pre-trained decoder model whose output logits determine entropy.
            entropy_threshold: Entropy value above which a boundary is placed.
            max_token_size: Maximum allowed token size (prevents excessively long tokens).
        """

Import

from x_transformers.entropy_based_tokenizer import EntropyBasedTokenizer

I/O Contract

Inputs

Name Type Required Description
seq Tensor (b, n) or (n,) Yes Input token/byte sequence
lens Tensor (b,) No Actual lengths for variable-length batches
return_segmented_seq bool No If True, return actual segmented subsequences instead of lengths
decoder_forward_kwargs dict No Additional kwargs passed to the decoder forward

Outputs

Name Type Description
forward() default Tensor (b, max_tokens) Token lengths, zero-padded
forward() with return_segmented_seq list of list of Tensor Nested list of segmented subsequences per batch element

Usage Examples

Basic Entropy Tokenization

import torch
from x_transformers import TransformerWrapper, Decoder
from x_transformers.entropy_based_tokenizer import EntropyBasedTokenizer

# Use a pre-trained small decoder for entropy estimation
decoder = TransformerWrapper(
    num_tokens=256,
    max_seq_len=512,
    attn_layers=Decoder(dim=128, depth=4, heads=4)
)

tokenizer = EntropyBasedTokenizer(
    decoder=decoder,
    entropy_threshold=2.0,
    max_token_size=8
)

# Tokenize a byte sequence
seq = torch.randint(0, 256, (2, 100))
token_lengths = tokenizer(seq)
# token_lengths: shape (2, num_tokens) with the length of each variable-size token

Get Segmented Subsequences

# Get the actual segmented sequences
segments = tokenizer(seq, return_segmented_seq=True)
# segments[0] is a tuple of tensors, each being one variable-length token
for i, token in enumerate(segments[0]):
    print(f"Token {i}: length={len(token)}, values={token}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment