Implementation:Neuml Txtai Tokenizer Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Text_Processing, Tokenization |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
The Tokenizer pipeline class provides configurable text tokenization with multiple strategies including Unicode UTR #29 segmentation, emoji filtering, alphanumeric filtering, stop word removal, and custom regular expression patterns.
Description
The Tokenizer class inherits from Pipeline and offers both instance-based and static tokenization methods. It implements Unicode Technical Report #29 word boundary segmentation as its foundation, ensuring proper handling of text across languages. On top of this, it layers configurable filters: lowercase normalization, emoji detection and removal, alphanumeric-only filtering, stop word removal using a built-in stop word list, whitespace-based splitting as an alternative to UTR #29, and custom regular expression tokenization. The static tokenize method provides a convenient class-level interface with the same capabilities.
Usage
Use the Tokenizer pipeline when you need consistent, configurable text tokenization for preprocessing in search, indexing, or text analysis workflows. It is well-suited for preparing text for TF-IDF scoring, keyword extraction, or any downstream task that requires clean token sequences. Use the static method for one-off tokenization or the instance for repeated use with the same configuration.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/pipeline/data/tokenizer.py
- Lines: 1-130
Signature
class Tokenizer(Pipeline):
@staticmethod
def tokenize(text, lowercase=True, emoji=True, alphanum=True,
stopwords=True, whitespace=False, regexp=None):
"""
Static method to tokenize text with configurable options.
Args:
text: input text string
lowercase: convert to lowercase (default: True)
emoji: filter emoji characters (default: True)
alphanum: keep only alphanumeric tokens (default: True)
stopwords: remove stop words (default: True)
whitespace: use whitespace splitting instead of UTR #29 (default: False)
regexp: custom regular expression pattern for tokenization (default: None)
Returns:
list of token strings
"""
def __init__(self, lowercase=True, emoji=True, alphanum=False,
stopwords=False, whitespace=False, regexp=None):
"""
Creates a Tokenizer pipeline instance.
Args:
lowercase: convert to lowercase (default: True)
emoji: filter emoji characters (default: True)
alphanum: keep only alphanumeric tokens (default: False)
stopwords: remove stop words (default: False)
whitespace: use whitespace splitting instead of UTR #29 (default: False)
regexp: custom regular expression pattern for tokenization (default: None)
"""
def __call__(self, text):
"""
Tokenizes the input text using the configured options.
Args:
text: input text string or list of strings
Returns:
list of token strings (or list of lists for batch input)
"""
Import
from txtai.pipeline import Tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str or list[str] | Yes (for __call__) | Input text to tokenize, or list of texts for batch tokenization |
| lowercase | bool | No | Convert text to lowercase before tokenization (default: True) |
| emoji | bool | No | Filter out emoji characters from tokens (default: True for static, True for instance) |
| alphanum | bool | No | Keep only tokens containing alphanumeric characters (default: True for static, False for instance) |
| stopwords | bool | No | Remove common English stop words (default: True for static, False for instance) |
| whitespace | bool | No | Use simple whitespace splitting instead of Unicode UTR #29 segmentation (default: False) |
| regexp | str | No | Custom regular expression pattern for tokenization, overrides other splitting methods (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokens | list[str] | List of token strings after applying all configured filters |
| tokens (batch) | list[list[str]] | List of token lists when batch input is provided |
Usage Examples
Basic Usage
from txtai.pipeline import Tokenizer
# Use the static method for quick tokenization
tokens = Tokenizer.tokenize("Machine Learning is transforming the world!")
print(tokens)
# ['machine', 'learning', 'transforming', 'world']
# (lowercased, stop words removed, alphanumeric only)
Instance-Based Tokenization
from txtai.pipeline import Tokenizer
# Create a tokenizer with custom settings
tokenizer = Tokenizer(
lowercase=True,
emoji=True,
alphanum=True,
stopwords=True
)
# Tokenize text
tokens = tokenizer("Natural Language Processing with Python 3.10")
print(tokens)
# ['natural', 'language', 'processing', 'python', '3', '10']
Custom Regular Expression
from txtai.pipeline import Tokenizer
# Tokenize using a custom regex pattern (split on hyphens and whitespace)
tokenizer = Tokenizer(regexp=r"[\s\-]+")
tokens = tokenizer("state-of-the-art machine-learning model")
print(tokens)
# ['state', 'of', 'the', 'art', 'machine', 'learning', 'model']
Whitespace-Based Splitting
from txtai.pipeline import Tokenizer
# Use simple whitespace splitting instead of UTR #29
tokenizer = Tokenizer(whitespace=True, lowercase=True, emoji=True)
tokens = tokenizer("Hello World! This is a test.")
print(tokens)
# ['hello', 'world!', 'this', 'is', 'a', 'test.']
# Note: punctuation is preserved with whitespace splitting
Emoji Filtering
from txtai.pipeline import Tokenizer
# Demonstrate emoji filtering
text_with_emoji = "Great results from the model! Best performance ever!"
# With emoji filtering (default)
tokens_filtered = Tokenizer.tokenize(text_with_emoji, emoji=True)
print(tokens_filtered)
# Without emoji filtering
tokens_unfiltered = Tokenizer.tokenize(text_with_emoji, emoji=False)
print(tokens_unfiltered)