Implementation:Neuml Txtai QA Data
| Knowledge Sources | |
|---|---|
| Domains | NLP, Question Answering, Training Data |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for tokenizing question-answering datasets for model training provided by txtai.
Description
The Questions class extends the base Data class to tokenize question-answering datasets as input for training extractive QA models. It handles the complex mapping between character-level answer spans and token-level positions. The tokenizer processes question-context pairs with configurable stride for sliding window chunking, handles overflow tokens via sample mapping, and computes start/end token positions for each answer. When no answer is present, the CLS token index is used as the answer position. The class correctly handles both left-padded and right-padded tokenizers by adjusting sequence ID lookups and truncation strategy.
Usage
Use the Questions data processor when training extractive question-answering models (e.g., fine-tuning BERT/RoBERTa for SQuAD-style tasks). It is configured with a tokenizer, column names for question/context/answers, maximum sequence length, and stride. The processor maps raw QA datasets into tokenized training-ready format with start_positions and end_positions labels.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/data/questions.py
Signature
class Questions(Data):
def __init__(self, tokenizer, columns, maxlength, stride)
def process(self, data)
def tokenize(self, data)
def answers(self, data, index)
Import
from txtai.data.questions import Questions
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | PreTrainedTokenizer | Yes | Hugging Face model tokenizer instance |
| columns | tuple | No | Tuple of (question, context, answers) column names; defaults to ("question", "context", "answers") |
| maxlength | int | Yes | Maximum sequence length for tokenization |
| stride | int | Yes | Chunk size / stride for sliding window when splitting long contexts |
| data | dict | Yes (process) | Batch of data in column-oriented format with question, context, and answer columns |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenized | dict | Tokenized output containing input_ids, attention_mask, start_positions, and end_positions lists suitable for QA model training |
Usage Examples
from transformers import AutoTokenizer
from txtai.data.questions import Questions
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Create QA data processor
qa_data = Questions(
tokenizer=tokenizer,
columns=("question", "context", "answers"),
maxlength=384,
stride=128
)
# Prepare training and validation datasets
# Dataset should have "question", "context", and "answers" columns
# answers format: {"text": ["answer text"], "answer_start": [char_offset]}
train_dataset, val_dataset = qa_data(train_data, val_data, workers=4)