Implementation:Neuml Txtai Data Tokenizers

Knowledge Sources	txtai txtai Documentation
Domains	Training, NLP
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for tokenizing and formatting raw datasets into model-ready training data provided by the txtai library.

Description

The txtai data package provides four task-specific tokenization classes -- Labels, Questions, Sequences, and Texts -- each inheriting from a common Data base class. Each class implements a process() method that defines how a batch of raw examples is tokenized and what output fields are produced. The shared __call__ method on the base class orchestrates the full dataset preparation: it iterates over training and (optionally) validation data, applies the task-specific process() function, and returns ready-to-train tokenized datasets.

Labels -- tokenizes text-classification datasets. Supports single-text and text-pair inputs. Produces input_ids, attention_mask, and a label field.
Questions -- tokenizes extractive QA datasets. Handles stride-based chunking of long contexts, maps character-level answer spans to token-level start_positions/end_positions, and falls back to the CLS token when no answer exists.
Sequences -- tokenizes sequence-to-sequence datasets. Tokenizes source and target independently, supports an optional source prefix string, and places target IDs in the labels field.
Texts -- tokenizes raw text for language model pretraining. Concatenates all tokenized text into a flat stream and chunks it into fixed-length segments of maxlength.

Usage

These classes are used internally by HFTrainer during the prepare() step, but they can also be instantiated directly when custom data pipelines are needed.

Code Reference

Source Location

Repository: txtai
Files:
- src/python/txtai/data/labels.py (Lines 13-42)
- src/python/txtai/data/questions.py (Lines 13-84)
- src/python/txtai/data/sequences.py (Lines 13-48)
- src/python/txtai/data/texts.py (Lines 15-68)
- src/python/txtai/data/base.py (Lines 27-40 for __call__)

Signature

# Base class callable (shared by all subclasses)
class Data:
    def __call__(self, train, validation, workers):
        ...

# Text classification
class Labels(Data):
    def __init__(self, tokenizer, columns, maxlength):
        ...

# Question answering
class Questions(Data):
    def __init__(self, tokenizer, columns, maxlength, stride):
        ...

# Sequence-to-sequence
class Sequences(Data):
    def __init__(self, tokenizer, columns, maxlength, prefix):
        ...

# Language modeling
class Texts(Data):
    def __init__(self, tokenizer, columns, maxlength):
        ...

Import

from txtai.data import Labels, Questions, Sequences, Texts

I/O Contract

Inputs

Name	Type	Required	Description
tokenizer	AutoTokenizer	Yes	HuggingFace tokenizer instance matching the base model.
columns	tuple	No	Column name mapping for the dataset. Defaults vary by task: `("text", None, "label")` for Labels, `("question", "context", "answers")` for Questions, `("source", "target")` for Sequences, `("text", None)` for Texts.
maxlength	int	Yes	Maximum token sequence length. Sequences are truncated to this value; Texts chunks are exactly this size.
stride	int	Yes (Questions only)	Overlap size in tokens when splitting long contexts into overlapping windows.
prefix	str	No (Sequences only)	String prepended to every source text before tokenization (e.g., `"summarize: "`).
train	Dataset / DataFrame / iterable	Yes (at `__call__`)	Training dataset in any of the supported formats.
validation	Dataset / DataFrame / iterable	No (at `__call__`)	Optional validation dataset.
workers	int	No (at `__call__`)	Number of concurrent tokenization processes. `None` uses the main process only.

Outputs

Name	Type	Description
result	tuple	A `(train_tokens, validation_tokens)` tuple. Each element is either a HuggingFace mapped `Dataset` (when the input was a HF Dataset) or a `Tokens` instance (a torch `Dataset` wrapping the tokenized dicts). `validation_tokens` is `None` when no validation data is provided.

Usage Examples

Basic Example: Text Classification

from transformers import AutoTokenizer
from txtai.data import Labels

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Training data as list of dicts
train = [
    {"text": "This movie was great!", "label": 1},
    {"text": "Terrible experience.", "label": 0},
]

processor = Labels(tokenizer, columns=("text", None, "label"), maxlength=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Question Answering

from transformers import AutoTokenizer
from txtai.data import Questions

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

train = [
    {
        "question": "What is txtai?",
        "context": "txtai is an all-in-one embeddings database.",
        "answers": {"text": ["all-in-one embeddings database"], "answer_start": [12]},
    }
]

processor = Questions(tokenizer, columns=None, maxlength=384, stride=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Sequence-to-Sequence with Prefix

from transformers import AutoTokenizer
from txtai.data import Sequences

tokenizer = AutoTokenizer.from_pretrained("t5-small")

train = [
    {"source": "The quick brown fox jumps over the lazy dog.", "target": "A fox jumps over a dog."},
]

processor = Sequences(tokenizer, columns=None, maxlength=512, prefix="summarize: ")
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Language Modeling

from transformers import AutoTokenizer
from txtai.data import Texts

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

train = [
    {"text": "txtai builds AI-powered semantic search applications."},
    {"text": "Embeddings databases enable similarity search over content."},
]

processor = Texts(tokenizer, columns=None, maxlength=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Related Pages

Implements Principle

Principle:Neuml_Txtai_Training_Data_Preparation

Requires Environment

Environment:Neuml_Txtai_Python_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment