Implementation:Neuml Txtai Texts Data

Knowledge Sources	Neuml_Txtai
Domains	NLP, Language Modeling, Training Data
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for tokenizing text datasets for language model training provided by txtai.

Description

The Texts class extends the base Data class to tokenize text datasets as input for training language models. It supports both single-text and text-pair tokenization depending on the configured columns. After tokenization, the class concatenates all tokenized output and splits it into fixed-size chunks of maxlength tokens, dropping any incomplete final chunk. This approach is standard for causal and masked language model pre-training where the model processes uniform-length sequences. The class also requests special tokens masks from the tokenizer for proper masking during training.

Usage

Use the Texts data processor when preparing data for language model training tasks such as masked language modeling (MLM) or causal language modeling (CLM). Configure it with a tokenizer, column names, and the desired maximum sequence length. The processor handles single text or text-pair inputs and returns uniform-length tokenized chunks ready for training.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/data/texts.py

Signature

class Texts(Data):
    def __init__(self, tokenizer, columns, maxlength)
    def process(self, data)
    def concat(self, inputs)

Import

from txtai.data.texts import Texts

I/O Contract

Inputs

Name	Type	Required	Description
tokenizer	PreTrainedTokenizer	Yes	Hugging Face model tokenizer instance
columns	tuple	No	Tuple of (text_column, optional_second_column) for single or pair tokenization; defaults to ("text", None)
maxlength	int	Yes	Maximum sequence length for each output chunk
data	dict	Yes (process)	Batch of data in column-oriented format containing the configured text column(s)

Outputs

Name	Type	Description
tokenized chunks	dict	Dictionary of tokenized fields (input_ids, attention_mask, special_tokens_mask, etc.) where each value is a list of fixed-length chunks of size maxlength

Usage Examples

from transformers import AutoTokenizer
from txtai.data.texts import Texts

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Create text data processor for MLM training
text_data = Texts(
    tokenizer=tokenizer,
    columns=("text", None),
    maxlength=512
)

# Prepare training and validation datasets
# Each entry in the dataset should have a "text" column
train_dataset, val_dataset = text_data(train_data, val_data, workers=4)

# For text-pair tasks (e.g., NLI), specify two columns
pair_data = Texts(
    tokenizer=tokenizer,
    columns=("premise", "hypothesis"),
    maxlength=512
)

Related Pages

Environment:Neuml_Txtai_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment