Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Optimum TextClassificationProcessing

From Leeroopedia
Knowledge Sources
Domains Preprocessing, NLP
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for preprocessing text classification datasets with single or paired sentence tokenization provided by the Huggingface Optimum library.

Description

TextClassificationProcessing is a TaskProcessor subclass for text classification (and NLI) tasks. It supports both single-sentence and sentence-pair classification by mapping "primary" and optional "secondary" data keys to the tokenizer's text and text_pair arguments. The default dataset is GLUE SST-2.

Usage

Use this processor when benchmarking text classification models. It automatically infers sentence and hypothesis columns from dataset column names when data keys are not explicitly provided.

Code Reference

Source Location

Signature

class TextClassificationProcessing(TaskProcessor):
    ACCEPTED_PREPROCESSOR_CLASSES = (PreTrainedTokenizerBase,)
    DEFAULT_DATASET_ARGS = {"path": "glue", "name": "sst2"}
    DEFAULT_DATASET_DATA_KEYS = {"primary": "sentence"}
    ALLOWED_DATA_KEY_NAMES = {"primary", "secondary"}
    DEFAULT_REF_KEYS = ["label"]

Import

from optimum.utils.preprocessing.text_classification import TextClassificationProcessing

I/O Contract

Inputs

Name Type Required Description
config PretrainedConfig Yes The model configuration
preprocessor PreTrainedTokenizerBase Yes Tokenizer for the model
preprocessor_kwargs Dict[str, Any] No Override defaults (padding, truncation, max_length)

Outputs

Name Type Description
dataset_processing_func output Dict Tokenized inputs with input_ids, attention_mask

Usage Examples

from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.text_classification import TextClassificationProcessing

config = AutoConfig.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

processor = TextClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment