Implementation:Huggingface Optimum QuestionAnsweringProcessing
| Knowledge Sources | |
|---|---|
| Domains | Preprocessing, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for preprocessing question answering datasets with tokenizer padding, truncation, and stride handling provided by the Huggingface Optimum library.
Description
QuestionAnsweringProcessing is a TaskProcessor subclass for extractive QA tasks. It tokenizes question-context pairs with configurable padding side awareness (question|context vs context|question order), automatic stride for long contexts, and appropriate truncation. The default dataset is SQuAD v2.
Usage
Use this processor when benchmarking or evaluating extractive question answering models. It handles the complexities of QA tokenization including padding side detection and stride configuration.
Code Reference
Source Location
- Repository: Huggingface_Optimum
- File: optimum/utils/preprocessing/question_answering.py
- Lines: 1-93
Signature
class QuestionAnsweringProcessing(TaskProcessor):
ACCEPTED_PREPROCESSOR_CLASSES = (PreTrainedTokenizerBase,)
DEFAULT_DATASET_ARGS = "squad_v2"
DEFAULT_DATASET_DATA_KEYS = {"question": "question", "context": "context"}
ALLOWED_DATA_KEY_NAMES = {"question", "context"}
DEFAULT_REF_KEYS = ["answers"]
Import
from optimum.utils.preprocessing.question_answering import QuestionAnsweringProcessing
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | PretrainedConfig | Yes | The model configuration |
| preprocessor | PreTrainedTokenizerBase | Yes | Tokenizer for the model |
| preprocessor_kwargs | Dict[str, Any] | No | Override defaults (max_length, stride, padding, truncation) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset_processing_func output | Dict | Tokenized inputs with input_ids, attention_mask, token_type_ids |
Usage Examples
from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.question_answering import QuestionAnsweringProcessing
config = AutoConfig.from_pretrained("deepset/roberta-base-squad2")
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
processor = QuestionAnsweringProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)