Implementation:Neuml Txtai QA Data

Knowledge Sources	Neuml_Txtai
Domains	NLP, Question Answering, Training Data
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for tokenizing question-answering datasets for model training provided by txtai.

Description

The Questions class extends the base Data class to tokenize question-answering datasets as input for training extractive QA models. It handles the complex mapping between character-level answer spans and token-level positions. The tokenizer processes question-context pairs with configurable stride for sliding window chunking, handles overflow tokens via sample mapping, and computes start/end token positions for each answer. When no answer is present, the CLS token index is used as the answer position. The class correctly handles both left-padded and right-padded tokenizers by adjusting sequence ID lookups and truncation strategy.

Usage

Use the Questions data processor when training extractive question-answering models (e.g., fine-tuning BERT/RoBERTa for SQuAD-style tasks). It is configured with a tokenizer, column names for question/context/answers, maximum sequence length, and stride. The processor maps raw QA datasets into tokenized training-ready format with start_positions and end_positions labels.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/data/questions.py

Signature

class Questions(Data):
    def __init__(self, tokenizer, columns, maxlength, stride)
    def process(self, data)
    def tokenize(self, data)
    def answers(self, data, index)

Import

from txtai.data.questions import Questions

I/O Contract

Inputs

Name	Type	Required	Description
tokenizer	PreTrainedTokenizer	Yes	Hugging Face model tokenizer instance
columns	tuple	No	Tuple of (question, context, answers) column names; defaults to ("question", "context", "answers")
maxlength	int	Yes	Maximum sequence length for tokenization
stride	int	Yes	Chunk size / stride for sliding window when splitting long contexts
data	dict	Yes (process)	Batch of data in column-oriented format with question, context, and answer columns

Outputs

Name	Type	Description
tokenized	dict	Tokenized output containing input_ids, attention_mask, start_positions, and end_positions lists suitable for QA model training

Usage Examples

from transformers import AutoTokenizer
from txtai.data.questions import Questions

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Create QA data processor
qa_data = Questions(
    tokenizer=tokenizer,
    columns=("question", "context", "answers"),
    maxlength=384,
    stride=128
)

# Prepare training and validation datasets
# Dataset should have "question", "context", and "answers" columns
# answers format: {"text": ["answer text"], "answer_start": [char_offset]}
train_dataset, val_dataset = qa_data(train_data, val_data, workers=4)

Related Pages

Environment:Neuml_Txtai_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment