Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Optimum TaskProcessor

From Leeroopedia
Revision as of 13:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Optimum_TaskProcessor.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Preprocessing, Data_Processing
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for task-specific dataset preprocessing, loading, and column inference provided by the Huggingface Optimum library as an abstract base class.

Description

TaskProcessor is the abstract base class for all task-specific dataset processors in Optimum. It defines the interface for:

  • Validating preprocessors against accepted types
  • Processing dataset examples via dataset_processing_func
  • Loading datasets from the Hub or local paths with automatic data key and reference key inference
  • Preparing datasets by applying the processing function via `dataset.map()`
  • Loading default datasets configured per task

Subclasses must implement dataset_processing_func, try_to_guess_data_keys, and try_to_guess_ref_keys to handle task-specific tokenization, data column mapping, and label detection.

Usage

Do not instantiate TaskProcessor directly. Use one of its concrete subclasses (TextClassificationProcessing, TokenClassificationProcessing, QuestionAnsweringProcessing, ImageClassificationProcessing) or obtain one via TaskProcessorsManager.for_task().

Code Reference

Source Location

Signature

class TaskProcessor(ABC):
    ACCEPTED_PREPROCESSOR_CLASSES: Tuple[Type, ...]
    DEFAULT_DATASET_ARGS: Union[str, Dict[str, Any]]
    DEFAULT_DATASET_DATA_KEYS: Dict[str, str]
    ALLOWED_DATA_KEY_NAMES: Set[str]
    DEFAULT_REF_KEYS: List[str]

    def __init__(
        self,
        config: "PretrainedConfig",
        preprocessor: Preprocessor,
        preprocessor_kwargs: Optional[Dict[str, Any]] = None,
    ):
        """
        Args:
            config: The model config.
            preprocessor: Tokenizer or image processor.
            preprocessor_kwargs: Extra keyword arguments for the preprocessor.
        """

    @abstractmethod
    def dataset_processing_func(
        self, example: Dict[str, Any], data_keys: Dict[str, str], ref_keys: Optional[List[str]] = None
    ) -> Dict[str, Any]: ...

    def prepare_dataset(
        self,
        dataset: Union["DatasetDict", "Dataset"],
        data_keys: Dict[str, str],
        ref_keys: Optional[List[str]] = None,
        split: Optional[str] = None,
    ) -> Union["DatasetDict", "Dataset"]: ...

    def load_dataset(self, path: str, ...) -> Union["DatasetDict", "Dataset"]: ...

    def load_default_dataset(self, ...) -> Union["DatasetDict", "Dataset"]: ...

    @abstractmethod
    def try_to_guess_data_keys(self, column_names: List[str]) -> Optional[Dict[str, str]]: ...

    @abstractmethod
    def try_to_guess_ref_keys(self, column_names: List[str]) -> Optional[List[str]]: ...

Import

from optimum.utils.preprocessing.base import TaskProcessor

I/O Contract

Inputs

Name Type Required Description
config PretrainedConfig Yes The model configuration
preprocessor Preprocessor Yes A tokenizer or image processor matching ACCEPTED_PREPROCESSOR_CLASSES
preprocessor_kwargs Dict[str, Any] No Additional kwargs passed to the preprocessor during processing

Outputs

Name Type Description
load_dataset() Dataset or DatasetDict Processed dataset ready for model consumption
prepare_dataset() Dataset or DatasetDict Dataset with processing function applied via map()

Usage Examples

Using a Concrete Subclass

from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.text_classification import TextClassificationProcessing

config = AutoConfig.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

processor = TextClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)
print(dataset.column_names)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment