Implementation:Neuml Txtai Data Tokenizers
| Knowledge Sources | |
|---|---|
| Domains | Training, NLP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for tokenizing and formatting raw datasets into model-ready training data provided by the txtai library.
Description
The txtai data package provides four task-specific tokenization classes -- Labels, Questions, Sequences, and Texts -- each inheriting from a common Data base class. Each class implements a process() method that defines how a batch of raw examples is tokenized and what output fields are produced. The shared __call__ method on the base class orchestrates the full dataset preparation: it iterates over training and (optionally) validation data, applies the task-specific process() function, and returns ready-to-train tokenized datasets.
- Labels -- tokenizes text-classification datasets. Supports single-text and text-pair inputs. Produces
input_ids,attention_mask, and alabelfield. - Questions -- tokenizes extractive QA datasets. Handles stride-based chunking of long contexts, maps character-level answer spans to token-level
start_positions/end_positions, and falls back to the CLS token when no answer exists. - Sequences -- tokenizes sequence-to-sequence datasets. Tokenizes source and target independently, supports an optional source prefix string, and places target IDs in the
labelsfield. - Texts -- tokenizes raw text for language model pretraining. Concatenates all tokenized text into a flat stream and chunks it into fixed-length segments of
maxlength.
Usage
These classes are used internally by HFTrainer during the prepare() step, but they can also be instantiated directly when custom data pipelines are needed.
Code Reference
Source Location
- Repository: txtai
- Files:
src/python/txtai/data/labels.py(Lines 13-42)src/python/txtai/data/questions.py(Lines 13-84)src/python/txtai/data/sequences.py(Lines 13-48)src/python/txtai/data/texts.py(Lines 15-68)src/python/txtai/data/base.py(Lines 27-40 for__call__)
Signature
# Base class callable (shared by all subclasses)
class Data:
def __call__(self, train, validation, workers):
...
# Text classification
class Labels(Data):
def __init__(self, tokenizer, columns, maxlength):
...
# Question answering
class Questions(Data):
def __init__(self, tokenizer, columns, maxlength, stride):
...
# Sequence-to-sequence
class Sequences(Data):
def __init__(self, tokenizer, columns, maxlength, prefix):
...
# Language modeling
class Texts(Data):
def __init__(self, tokenizer, columns, maxlength):
...
Import
from txtai.data import Labels, Questions, Sequences, Texts
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | AutoTokenizer | Yes | HuggingFace tokenizer instance matching the base model. |
| columns | tuple | No | Column name mapping for the dataset. Defaults vary by task: ("text", None, "label") for Labels, ("question", "context", "answers") for Questions, ("source", "target") for Sequences, ("text", None) for Texts.
|
| maxlength | int | Yes | Maximum token sequence length. Sequences are truncated to this value; Texts chunks are exactly this size. |
| stride | int | Yes (Questions only) | Overlap size in tokens when splitting long contexts into overlapping windows. |
| prefix | str | No (Sequences only) | String prepended to every source text before tokenization (e.g., "summarize: ").
|
| train | Dataset / DataFrame / iterable | Yes (at __call__) |
Training dataset in any of the supported formats. |
| validation | Dataset / DataFrame / iterable | No (at __call__) |
Optional validation dataset. |
| workers | int | No (at __call__) |
Number of concurrent tokenization processes. None uses the main process only.
|
Outputs
| Name | Type | Description |
|---|---|---|
| result | tuple | A (train_tokens, validation_tokens) tuple. Each element is either a HuggingFace mapped Dataset (when the input was a HF Dataset) or a Tokens instance (a torch Dataset wrapping the tokenized dicts). validation_tokens is None when no validation data is provided.
|
Usage Examples
Basic Example: Text Classification
from transformers import AutoTokenizer
from txtai.data import Labels
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Training data as list of dicts
train = [
{"text": "This movie was great!", "label": 1},
{"text": "Terrible experience.", "label": 0},
]
processor = Labels(tokenizer, columns=("text", None, "label"), maxlength=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)
Question Answering
from transformers import AutoTokenizer
from txtai.data import Questions
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
train = [
{
"question": "What is txtai?",
"context": "txtai is an all-in-one embeddings database.",
"answers": {"text": ["all-in-one embeddings database"], "answer_start": [12]},
}
]
processor = Questions(tokenizer, columns=None, maxlength=384, stride=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)
Sequence-to-Sequence with Prefix
from transformers import AutoTokenizer
from txtai.data import Sequences
tokenizer = AutoTokenizer.from_pretrained("t5-small")
train = [
{"source": "The quick brown fox jumps over the lazy dog.", "target": "A fox jumps over a dog."},
]
processor = Sequences(tokenizer, columns=None, maxlength=512, prefix="summarize: ")
train_tokens, val_tokens = processor(train, validation=None, workers=None)
Language Modeling
from transformers import AutoTokenizer
from txtai.data import Texts
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
train = [
{"text": "txtai builds AI-powered semantic search applications."},
{"text": "Embeddings databases enable similarity search over content."},
]
processor = Texts(tokenizer, columns=None, maxlength=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)