Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai QA Data

From Leeroopedia


Knowledge Sources
Domains NLP, Question Answering, Training Data
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for tokenizing question-answering datasets for model training provided by txtai.

Description

The Questions class extends the base Data class to tokenize question-answering datasets as input for training extractive QA models. It handles the complex mapping between character-level answer spans and token-level positions. The tokenizer processes question-context pairs with configurable stride for sliding window chunking, handles overflow tokens via sample mapping, and computes start/end token positions for each answer. When no answer is present, the CLS token index is used as the answer position. The class correctly handles both left-padded and right-padded tokenizers by adjusting sequence ID lookups and truncation strategy.

Usage

Use the Questions data processor when training extractive question-answering models (e.g., fine-tuning BERT/RoBERTa for SQuAD-style tasks). It is configured with a tokenizer, column names for question/context/answers, maximum sequence length, and stride. The processor maps raw QA datasets into tokenized training-ready format with start_positions and end_positions labels.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/data/questions.py

Signature

class Questions(Data):
    def __init__(self, tokenizer, columns, maxlength, stride)
    def process(self, data)
    def tokenize(self, data)
    def answers(self, data, index)

Import

from txtai.data.questions import Questions

I/O Contract

Inputs

Name Type Required Description
tokenizer PreTrainedTokenizer Yes Hugging Face model tokenizer instance
columns tuple No Tuple of (question, context, answers) column names; defaults to ("question", "context", "answers")
maxlength int Yes Maximum sequence length for tokenization
stride int Yes Chunk size / stride for sliding window when splitting long contexts
data dict Yes (process) Batch of data in column-oriented format with question, context, and answer columns

Outputs

Name Type Description
tokenized dict Tokenized output containing input_ids, attention_mask, start_positions, and end_positions lists suitable for QA model training

Usage Examples

from transformers import AutoTokenizer
from txtai.data.questions import Questions

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Create QA data processor
qa_data = Questions(
    tokenizer=tokenizer,
    columns=("question", "context", "answers"),
    maxlength=384,
    stride=128
)

# Prepare training and validation datasets
# Dataset should have "question", "context", and "answers" columns
# answers format: {"text": ["answer text"], "answer_start": [char_offset]}
train_dataset, val_dataset = qa_data(train_data, val_data, workers=4)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment