Implementation:Neuml Txtai Tokenizer Pipeline

Knowledge Sources	Neuml_Txtai
Domains	Text_Processing, Tokenization
Last Updated	2026-02-09 17:00 GMT

Overview

The Tokenizer pipeline class provides configurable text tokenization with multiple strategies including Unicode UTR #29 segmentation, emoji filtering, alphanumeric filtering, stop word removal, and custom regular expression patterns.

Description

The Tokenizer class inherits from Pipeline and offers both instance-based and static tokenization methods. It implements Unicode Technical Report #29 word boundary segmentation as its foundation, ensuring proper handling of text across languages. On top of this, it layers configurable filters: lowercase normalization, emoji detection and removal, alphanumeric-only filtering, stop word removal using a built-in stop word list, whitespace-based splitting as an alternative to UTR #29, and custom regular expression tokenization. The static tokenize method provides a convenient class-level interface with the same capabilities.

Usage

Use the Tokenizer pipeline when you need consistent, configurable text tokenization for preprocessing in search, indexing, or text analysis workflows. It is well-suited for preparing text for TF-IDF scoring, keyword extraction, or any downstream task that requires clean token sequences. Use the static method for one-off tokenization or the instance for repeated use with the same configuration.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/pipeline/data/tokenizer.py
Lines: 1-130

Signature

class Tokenizer(Pipeline):
    @staticmethod
    def tokenize(text, lowercase=True, emoji=True, alphanum=True,
                 stopwords=True, whitespace=False, regexp=None):
        """
        Static method to tokenize text with configurable options.

        Args:
            text: input text string
            lowercase: convert to lowercase (default: True)
            emoji: filter emoji characters (default: True)
            alphanum: keep only alphanumeric tokens (default: True)
            stopwords: remove stop words (default: True)
            whitespace: use whitespace splitting instead of UTR #29 (default: False)
            regexp: custom regular expression pattern for tokenization (default: None)

        Returns:
            list of token strings
        """

    def __init__(self, lowercase=True, emoji=True, alphanum=False,
                 stopwords=False, whitespace=False, regexp=None):
        """
        Creates a Tokenizer pipeline instance.

        Args:
            lowercase: convert to lowercase (default: True)
            emoji: filter emoji characters (default: True)
            alphanum: keep only alphanumeric tokens (default: False)
            stopwords: remove stop words (default: False)
            whitespace: use whitespace splitting instead of UTR #29 (default: False)
            regexp: custom regular expression pattern for tokenization (default: None)
        """

    def __call__(self, text):
        """
        Tokenizes the input text using the configured options.

        Args:
            text: input text string or list of strings

        Returns:
            list of token strings (or list of lists for batch input)
        """

Import

from txtai.pipeline import Tokenizer

I/O Contract

Inputs

Name	Type	Required	Description
text	str or list[str]	Yes (for __call__)	Input text to tokenize, or list of texts for batch tokenization
lowercase	bool	No	Convert text to lowercase before tokenization (default: True)
emoji	bool	No	Filter out emoji characters from tokens (default: True for static, True for instance)
alphanum	bool	No	Keep only tokens containing alphanumeric characters (default: True for static, False for instance)
stopwords	bool	No	Remove common English stop words (default: True for static, False for instance)
whitespace	bool	No	Use simple whitespace splitting instead of Unicode UTR #29 segmentation (default: False)
regexp	str	No	Custom regular expression pattern for tokenization, overrides other splitting methods (default: None)

Outputs

Name	Type	Description
tokens	list[str]	List of token strings after applying all configured filters
tokens (batch)	list[list[str]]	List of token lists when batch input is provided

Usage Examples

Basic Usage

from txtai.pipeline import Tokenizer

# Use the static method for quick tokenization
tokens = Tokenizer.tokenize("Machine Learning is transforming the world!")
print(tokens)
# ['machine', 'learning', 'transforming', 'world']
# (lowercased, stop words removed, alphanumeric only)

Instance-Based Tokenization

from txtai.pipeline import Tokenizer

# Create a tokenizer with custom settings
tokenizer = Tokenizer(
    lowercase=True,
    emoji=True,
    alphanum=True,
    stopwords=True
)

# Tokenize text
tokens = tokenizer("Natural Language Processing with Python 3.10")
print(tokens)
# ['natural', 'language', 'processing', 'python', '3', '10']

Custom Regular Expression

from txtai.pipeline import Tokenizer

# Tokenize using a custom regex pattern (split on hyphens and whitespace)
tokenizer = Tokenizer(regexp=r"[\s\-]+")

tokens = tokenizer("state-of-the-art machine-learning model")
print(tokens)
# ['state', 'of', 'the', 'art', 'machine', 'learning', 'model']

Whitespace-Based Splitting

from txtai.pipeline import Tokenizer

# Use simple whitespace splitting instead of UTR #29
tokenizer = Tokenizer(whitespace=True, lowercase=True, emoji=True)

tokens = tokenizer("Hello World! This is a test.")
print(tokens)
# ['hello', 'world!', 'this', 'is', 'a', 'test.']
# Note: punctuation is preserved with whitespace splitting

Emoji Filtering

from txtai.pipeline import Tokenizer

# Demonstrate emoji filtering
text_with_emoji = "Great results from the model! Best performance ever!"

# With emoji filtering (default)
tokens_filtered = Tokenizer.tokenize(text_with_emoji, emoji=True)
print(tokens_filtered)

# Without emoji filtering
tokens_unfiltered = Tokenizer.tokenize(text_with_emoji, emoji=False)
print(tokens_unfiltered)

Related Pages

Principle:Neuml_Txtai_Text_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment