Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Tokenizer Pipeline

From Leeroopedia


Knowledge Sources
Domains Text_Processing, Tokenization
Last Updated 2026-02-09 17:00 GMT

Overview

The Tokenizer pipeline class provides configurable text tokenization with multiple strategies including Unicode UTR #29 segmentation, emoji filtering, alphanumeric filtering, stop word removal, and custom regular expression patterns.

Description

The Tokenizer class inherits from Pipeline and offers both instance-based and static tokenization methods. It implements Unicode Technical Report #29 word boundary segmentation as its foundation, ensuring proper handling of text across languages. On top of this, it layers configurable filters: lowercase normalization, emoji detection and removal, alphanumeric-only filtering, stop word removal using a built-in stop word list, whitespace-based splitting as an alternative to UTR #29, and custom regular expression tokenization. The static tokenize method provides a convenient class-level interface with the same capabilities.

Usage

Use the Tokenizer pipeline when you need consistent, configurable text tokenization for preprocessing in search, indexing, or text analysis workflows. It is well-suited for preparing text for TF-IDF scoring, keyword extraction, or any downstream task that requires clean token sequences. Use the static method for one-off tokenization or the instance for repeated use with the same configuration.

Code Reference

Source Location

Signature

class Tokenizer(Pipeline):
    @staticmethod
    def tokenize(text, lowercase=True, emoji=True, alphanum=True,
                 stopwords=True, whitespace=False, regexp=None):
        """
        Static method to tokenize text with configurable options.

        Args:
            text: input text string
            lowercase: convert to lowercase (default: True)
            emoji: filter emoji characters (default: True)
            alphanum: keep only alphanumeric tokens (default: True)
            stopwords: remove stop words (default: True)
            whitespace: use whitespace splitting instead of UTR #29 (default: False)
            regexp: custom regular expression pattern for tokenization (default: None)

        Returns:
            list of token strings
        """

    def __init__(self, lowercase=True, emoji=True, alphanum=False,
                 stopwords=False, whitespace=False, regexp=None):
        """
        Creates a Tokenizer pipeline instance.

        Args:
            lowercase: convert to lowercase (default: True)
            emoji: filter emoji characters (default: True)
            alphanum: keep only alphanumeric tokens (default: False)
            stopwords: remove stop words (default: False)
            whitespace: use whitespace splitting instead of UTR #29 (default: False)
            regexp: custom regular expression pattern for tokenization (default: None)
        """

    def __call__(self, text):
        """
        Tokenizes the input text using the configured options.

        Args:
            text: input text string or list of strings

        Returns:
            list of token strings (or list of lists for batch input)
        """

Import

from txtai.pipeline import Tokenizer

I/O Contract

Inputs

Name Type Required Description
text str or list[str] Yes (for __call__) Input text to tokenize, or list of texts for batch tokenization
lowercase bool No Convert text to lowercase before tokenization (default: True)
emoji bool No Filter out emoji characters from tokens (default: True for static, True for instance)
alphanum bool No Keep only tokens containing alphanumeric characters (default: True for static, False for instance)
stopwords bool No Remove common English stop words (default: True for static, False for instance)
whitespace bool No Use simple whitespace splitting instead of Unicode UTR #29 segmentation (default: False)
regexp str No Custom regular expression pattern for tokenization, overrides other splitting methods (default: None)

Outputs

Name Type Description
tokens list[str] List of token strings after applying all configured filters
tokens (batch) list[list[str]] List of token lists when batch input is provided

Usage Examples

Basic Usage

from txtai.pipeline import Tokenizer

# Use the static method for quick tokenization
tokens = Tokenizer.tokenize("Machine Learning is transforming the world!")
print(tokens)
# ['machine', 'learning', 'transforming', 'world']
# (lowercased, stop words removed, alphanumeric only)

Instance-Based Tokenization

from txtai.pipeline import Tokenizer

# Create a tokenizer with custom settings
tokenizer = Tokenizer(
    lowercase=True,
    emoji=True,
    alphanum=True,
    stopwords=True
)

# Tokenize text
tokens = tokenizer("Natural Language Processing with Python 3.10")
print(tokens)
# ['natural', 'language', 'processing', 'python', '3', '10']

Custom Regular Expression

from txtai.pipeline import Tokenizer

# Tokenize using a custom regex pattern (split on hyphens and whitespace)
tokenizer = Tokenizer(regexp=r"[\s\-]+")

tokens = tokenizer("state-of-the-art machine-learning model")
print(tokens)
# ['state', 'of', 'the', 'art', 'machine', 'learning', 'model']

Whitespace-Based Splitting

from txtai.pipeline import Tokenizer

# Use simple whitespace splitting instead of UTR #29
tokenizer = Tokenizer(whitespace=True, lowercase=True, emoji=True)

tokens = tokenizer("Hello World! This is a test.")
print(tokens)
# ['hello', 'world!', 'this', 'is', 'a', 'test.']
# Note: punctuation is preserved with whitespace splitting

Emoji Filtering

from txtai.pipeline import Tokenizer

# Demonstrate emoji filtering
text_with_emoji = "Great results from the model! Best performance ever!"

# With emoji filtering (default)
tokens_filtered = Tokenizer.tokenize(text_with_emoji, emoji=True)
print(tokens_filtered)

# Without emoji filtering
tokens_unfiltered = Tokenizer.tokenize(text_with_emoji, emoji=False)
print(tokens_unfiltered)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment