Implementation:Huggingface Optimum TaskProcessor

Knowledge Sources	Huggingface_Optimum
Domains	Preprocessing, Data_Processing
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for task-specific dataset preprocessing, loading, and column inference provided by the Huggingface Optimum library as an abstract base class.

Description

TaskProcessor is the abstract base class for all task-specific dataset processors in Optimum. It defines the interface for:

Validating preprocessors against accepted types
Processing dataset examples via dataset_processing_func
Loading datasets from the Hub or local paths with automatic data key and reference key inference
Preparing datasets by applying the processing function via `dataset.map()`
Loading default datasets configured per task

Subclasses must implement dataset_processing_func, try_to_guess_data_keys, and try_to_guess_ref_keys to handle task-specific tokenization, data column mapping, and label detection.

Usage

Do not instantiate TaskProcessor directly. Use one of its concrete subclasses (TextClassificationProcessing, TokenClassificationProcessing, QuestionAnsweringProcessing, ImageClassificationProcessing) or obtain one via TaskProcessorsManager.for_task().

Code Reference

Source Location

Repository: Huggingface_Optimum
File: optimum/utils/preprocessing/base.py
Lines: 1-251

Signature

class TaskProcessor(ABC):
    ACCEPTED_PREPROCESSOR_CLASSES: Tuple[Type, ...]
    DEFAULT_DATASET_ARGS: Union[str, Dict[str, Any]]
    DEFAULT_DATASET_DATA_KEYS: Dict[str, str]
    ALLOWED_DATA_KEY_NAMES: Set[str]
    DEFAULT_REF_KEYS: List[str]

    def __init__(
        self,
        config: "PretrainedConfig",
        preprocessor: Preprocessor,
        preprocessor_kwargs: Optional[Dict[str, Any]] = None,
    ):
        """
        Args:
            config: The model config.
            preprocessor: Tokenizer or image processor.
            preprocessor_kwargs: Extra keyword arguments for the preprocessor.
        """

    @abstractmethod
    def dataset_processing_func(
        self, example: Dict[str, Any], data_keys: Dict[str, str], ref_keys: Optional[List[str]] = None
    ) -> Dict[str, Any]: ...

    def prepare_dataset(
        self,
        dataset: Union["DatasetDict", "Dataset"],
        data_keys: Dict[str, str],
        ref_keys: Optional[List[str]] = None,
        split: Optional[str] = None,
    ) -> Union["DatasetDict", "Dataset"]: ...

    def load_dataset(self, path: str, ...) -> Union["DatasetDict", "Dataset"]: ...

    def load_default_dataset(self, ...) -> Union["DatasetDict", "Dataset"]: ...

    @abstractmethod
    def try_to_guess_data_keys(self, column_names: List[str]) -> Optional[Dict[str, str]]: ...

    @abstractmethod
    def try_to_guess_ref_keys(self, column_names: List[str]) -> Optional[List[str]]: ...

Import

from optimum.utils.preprocessing.base import TaskProcessor

I/O Contract

Inputs

Name	Type	Required	Description
config	PretrainedConfig	Yes	The model configuration
preprocessor	Preprocessor	Yes	A tokenizer or image processor matching ACCEPTED_PREPROCESSOR_CLASSES
preprocessor_kwargs	Dict[str, Any]	No	Additional kwargs passed to the preprocessor during processing

Outputs

Name	Type	Description
load_dataset()	Dataset or DatasetDict	Processed dataset ready for model consumption
prepare_dataset()	Dataset or DatasetDict	Dataset with processing function applied via map()

Usage Examples

Using a Concrete Subclass

from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.text_classification import TextClassificationProcessing

config = AutoConfig.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

processor = TextClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)
print(dataset.column_names)

Related Pages

Environment:Huggingface_Optimum_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment