Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Neuml Txtai Data Tokenizers

From Leeroopedia


Knowledge Sources
Domains Training, NLP
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for tokenizing and formatting raw datasets into model-ready training data provided by the txtai library.

Description

The txtai data package provides four task-specific tokenization classes -- Labels, Questions, Sequences, and Texts -- each inheriting from a common Data base class. Each class implements a process() method that defines how a batch of raw examples is tokenized and what output fields are produced. The shared __call__ method on the base class orchestrates the full dataset preparation: it iterates over training and (optionally) validation data, applies the task-specific process() function, and returns ready-to-train tokenized datasets.

  • Labels -- tokenizes text-classification datasets. Supports single-text and text-pair inputs. Produces input_ids, attention_mask, and a label field.
  • Questions -- tokenizes extractive QA datasets. Handles stride-based chunking of long contexts, maps character-level answer spans to token-level start_positions/end_positions, and falls back to the CLS token when no answer exists.
  • Sequences -- tokenizes sequence-to-sequence datasets. Tokenizes source and target independently, supports an optional source prefix string, and places target IDs in the labels field.
  • Texts -- tokenizes raw text for language model pretraining. Concatenates all tokenized text into a flat stream and chunks it into fixed-length segments of maxlength.

Usage

These classes are used internally by HFTrainer during the prepare() step, but they can also be instantiated directly when custom data pipelines are needed.

Code Reference

Source Location

  • Repository: txtai
  • Files:
    • src/python/txtai/data/labels.py (Lines 13-42)
    • src/python/txtai/data/questions.py (Lines 13-84)
    • src/python/txtai/data/sequences.py (Lines 13-48)
    • src/python/txtai/data/texts.py (Lines 15-68)
    • src/python/txtai/data/base.py (Lines 27-40 for __call__)

Signature

# Base class callable (shared by all subclasses)
class Data:
    def __call__(self, train, validation, workers):
        ...

# Text classification
class Labels(Data):
    def __init__(self, tokenizer, columns, maxlength):
        ...

# Question answering
class Questions(Data):
    def __init__(self, tokenizer, columns, maxlength, stride):
        ...

# Sequence-to-sequence
class Sequences(Data):
    def __init__(self, tokenizer, columns, maxlength, prefix):
        ...

# Language modeling
class Texts(Data):
    def __init__(self, tokenizer, columns, maxlength):
        ...

Import

from txtai.data import Labels, Questions, Sequences, Texts

I/O Contract

Inputs

Name Type Required Description
tokenizer AutoTokenizer Yes HuggingFace tokenizer instance matching the base model.
columns tuple No Column name mapping for the dataset. Defaults vary by task: ("text", None, "label") for Labels, ("question", "context", "answers") for Questions, ("source", "target") for Sequences, ("text", None) for Texts.
maxlength int Yes Maximum token sequence length. Sequences are truncated to this value; Texts chunks are exactly this size.
stride int Yes (Questions only) Overlap size in tokens when splitting long contexts into overlapping windows.
prefix str No (Sequences only) String prepended to every source text before tokenization (e.g., "summarize: ").
train Dataset / DataFrame / iterable Yes (at __call__) Training dataset in any of the supported formats.
validation Dataset / DataFrame / iterable No (at __call__) Optional validation dataset.
workers int No (at __call__) Number of concurrent tokenization processes. None uses the main process only.

Outputs

Name Type Description
result tuple A (train_tokens, validation_tokens) tuple. Each element is either a HuggingFace mapped Dataset (when the input was a HF Dataset) or a Tokens instance (a torch Dataset wrapping the tokenized dicts). validation_tokens is None when no validation data is provided.

Usage Examples

Basic Example: Text Classification

from transformers import AutoTokenizer
from txtai.data import Labels

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Training data as list of dicts
train = [
    {"text": "This movie was great!", "label": 1},
    {"text": "Terrible experience.", "label": 0},
]

processor = Labels(tokenizer, columns=("text", None, "label"), maxlength=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Question Answering

from transformers import AutoTokenizer
from txtai.data import Questions

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

train = [
    {
        "question": "What is txtai?",
        "context": "txtai is an all-in-one embeddings database.",
        "answers": {"text": ["all-in-one embeddings database"], "answer_start": [12]},
    }
]

processor = Questions(tokenizer, columns=None, maxlength=384, stride=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Sequence-to-Sequence with Prefix

from transformers import AutoTokenizer
from txtai.data import Sequences

tokenizer = AutoTokenizer.from_pretrained("t5-small")

train = [
    {"source": "The quick brown fox jumps over the lazy dog.", "target": "A fox jumps over a dog."},
]

processor = Sequences(tokenizer, columns=None, maxlength=512, prefix="summarize: ")
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Language Modeling

from transformers import AutoTokenizer
from txtai.data import Texts

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

train = [
    {"text": "txtai builds AI-powered semantic search applications."},
    {"text": "Embeddings databases enable similarity search over content."},
]

processor = Texts(tokenizer, columns=None, maxlength=128)
train_tokens, val_tokens = processor(train, validation=None, workers=None)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment