Implementation:Neuml Txtai Texts Data
| Knowledge Sources | |
|---|---|
| Domains | NLP, Language Modeling, Training Data |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for tokenizing text datasets for language model training provided by txtai.
Description
The Texts class extends the base Data class to tokenize text datasets as input for training language models. It supports both single-text and text-pair tokenization depending on the configured columns. After tokenization, the class concatenates all tokenized output and splits it into fixed-size chunks of maxlength tokens, dropping any incomplete final chunk. This approach is standard for causal and masked language model pre-training where the model processes uniform-length sequences. The class also requests special tokens masks from the tokenizer for proper masking during training.
Usage
Use the Texts data processor when preparing data for language model training tasks such as masked language modeling (MLM) or causal language modeling (CLM). Configure it with a tokenizer, column names, and the desired maximum sequence length. The processor handles single text or text-pair inputs and returns uniform-length tokenized chunks ready for training.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/data/texts.py
Signature
class Texts(Data):
def __init__(self, tokenizer, columns, maxlength)
def process(self, data)
def concat(self, inputs)
Import
from txtai.data.texts import Texts
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | PreTrainedTokenizer | Yes | Hugging Face model tokenizer instance |
| columns | tuple | No | Tuple of (text_column, optional_second_column) for single or pair tokenization; defaults to ("text", None) |
| maxlength | int | Yes | Maximum sequence length for each output chunk |
| data | dict | Yes (process) | Batch of data in column-oriented format containing the configured text column(s) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenized chunks | dict | Dictionary of tokenized fields (input_ids, attention_mask, special_tokens_mask, etc.) where each value is a list of fixed-length chunks of size maxlength |
Usage Examples
from transformers import AutoTokenizer
from txtai.data.texts import Texts
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Create text data processor for MLM training
text_data = Texts(
tokenizer=tokenizer,
columns=("text", None),
maxlength=512
)
# Prepare training and validation datasets
# Each entry in the dataset should have a "text" column
train_dataset, val_dataset = text_data(train_data, val_data, workers=4)
# For text-pair tasks (e.g., NLI), specify two columns
pair_data = Texts(
tokenizer=tokenizer,
columns=("premise", "hypothesis"),
maxlength=512
)