Implementation:Huggingface Optimum TextClassificationProcessing

Knowledge Sources	Huggingface_Optimum
Domains	Preprocessing, NLP
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for preprocessing text classification datasets with single or paired sentence tokenization provided by the Huggingface Optimum library.

Description

TextClassificationProcessing is a TaskProcessor subclass for text classification (and NLI) tasks. It supports both single-sentence and sentence-pair classification by mapping "primary" and optional "secondary" data keys to the tokenizer's text and text_pair arguments. The default dataset is GLUE SST-2.

Usage

Use this processor when benchmarking text classification models. It automatically infers sentence and hypothesis columns from dataset column names when data keys are not explicitly provided.

Code Reference

Source Location

Repository: Huggingface_Optimum
File: optimum/utils/preprocessing/text_classification.py
Lines: 1-117

Signature

class TextClassificationProcessing(TaskProcessor):
    ACCEPTED_PREPROCESSOR_CLASSES = (PreTrainedTokenizerBase,)
    DEFAULT_DATASET_ARGS = {"path": "glue", "name": "sst2"}
    DEFAULT_DATASET_DATA_KEYS = {"primary": "sentence"}
    ALLOWED_DATA_KEY_NAMES = {"primary", "secondary"}
    DEFAULT_REF_KEYS = ["label"]

Import

from optimum.utils.preprocessing.text_classification import TextClassificationProcessing

I/O Contract

Inputs

Name	Type	Required	Description
config	PretrainedConfig	Yes	The model configuration
preprocessor	PreTrainedTokenizerBase	Yes	Tokenizer for the model
preprocessor_kwargs	Dict[str, Any]	No	Override defaults (padding, truncation, max_length)

Outputs

Name	Type	Description
dataset_processing_func output	Dict	Tokenized inputs with input_ids, attention_mask

Usage Examples

from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.text_classification import TextClassificationProcessing

config = AutoConfig.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

processor = TextClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)

Related Pages

Environment:Huggingface_Optimum_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment