Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Texts Data

From Leeroopedia


Knowledge Sources
Domains NLP, Language Modeling, Training Data
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for tokenizing text datasets for language model training provided by txtai.

Description

The Texts class extends the base Data class to tokenize text datasets as input for training language models. It supports both single-text and text-pair tokenization depending on the configured columns. After tokenization, the class concatenates all tokenized output and splits it into fixed-size chunks of maxlength tokens, dropping any incomplete final chunk. This approach is standard for causal and masked language model pre-training where the model processes uniform-length sequences. The class also requests special tokens masks from the tokenizer for proper masking during training.

Usage

Use the Texts data processor when preparing data for language model training tasks such as masked language modeling (MLM) or causal language modeling (CLM). Configure it with a tokenizer, column names, and the desired maximum sequence length. The processor handles single text or text-pair inputs and returns uniform-length tokenized chunks ready for training.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/data/texts.py

Signature

class Texts(Data):
    def __init__(self, tokenizer, columns, maxlength)
    def process(self, data)
    def concat(self, inputs)

Import

from txtai.data.texts import Texts

I/O Contract

Inputs

Name Type Required Description
tokenizer PreTrainedTokenizer Yes Hugging Face model tokenizer instance
columns tuple No Tuple of (text_column, optional_second_column) for single or pair tokenization; defaults to ("text", None)
maxlength int Yes Maximum sequence length for each output chunk
data dict Yes (process) Batch of data in column-oriented format containing the configured text column(s)

Outputs

Name Type Description
tokenized chunks dict Dictionary of tokenized fields (input_ids, attention_mask, special_tokens_mask, etc.) where each value is a list of fixed-length chunks of size maxlength

Usage Examples

from transformers import AutoTokenizer
from txtai.data.texts import Texts

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Create text data processor for MLM training
text_data = Texts(
    tokenizer=tokenizer,
    columns=("text", None),
    maxlength=512
)

# Prepare training and validation datasets
# Each entry in the dataset should have a "text" column
train_dataset, val_dataset = text_data(train_data, val_data, workers=4)

# For text-pair tasks (e.g., NLI), specify two columns
pair_data = Texts(
    tokenizer=tokenizer,
    columns=("premise", "hypothesis"),
    maxlength=512
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment