Implementation:Huggingface Optimum TextClassificationProcessing
| Knowledge Sources | |
|---|---|
| Domains | Preprocessing, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for preprocessing text classification datasets with single or paired sentence tokenization provided by the Huggingface Optimum library.
Description
TextClassificationProcessing is a TaskProcessor subclass for text classification (and NLI) tasks. It supports both single-sentence and sentence-pair classification by mapping "primary" and optional "secondary" data keys to the tokenizer's text and text_pair arguments. The default dataset is GLUE SST-2.
Usage
Use this processor when benchmarking text classification models. It automatically infers sentence and hypothesis columns from dataset column names when data keys are not explicitly provided.
Code Reference
Source Location
- Repository: Huggingface_Optimum
- File: optimum/utils/preprocessing/text_classification.py
- Lines: 1-117
Signature
class TextClassificationProcessing(TaskProcessor):
ACCEPTED_PREPROCESSOR_CLASSES = (PreTrainedTokenizerBase,)
DEFAULT_DATASET_ARGS = {"path": "glue", "name": "sst2"}
DEFAULT_DATASET_DATA_KEYS = {"primary": "sentence"}
ALLOWED_DATA_KEY_NAMES = {"primary", "secondary"}
DEFAULT_REF_KEYS = ["label"]
Import
from optimum.utils.preprocessing.text_classification import TextClassificationProcessing
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | PretrainedConfig | Yes | The model configuration |
| preprocessor | PreTrainedTokenizerBase | Yes | Tokenizer for the model |
| preprocessor_kwargs | Dict[str, Any] | No | Override defaults (padding, truncation, max_length) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset_processing_func output | Dict | Tokenized inputs with input_ids, attention_mask |
Usage Examples
from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.text_classification import TextClassificationProcessing
config = AutoConfig.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = TextClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)