Implementation:Huggingface Optimum TaskProcessor
| Knowledge Sources | |
|---|---|
| Domains | Preprocessing, Data_Processing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for task-specific dataset preprocessing, loading, and column inference provided by the Huggingface Optimum library as an abstract base class.
Description
TaskProcessor is the abstract base class for all task-specific dataset processors in Optimum. It defines the interface for:
- Validating preprocessors against accepted types
- Processing dataset examples via dataset_processing_func
- Loading datasets from the Hub or local paths with automatic data key and reference key inference
- Preparing datasets by applying the processing function via `dataset.map()`
- Loading default datasets configured per task
Subclasses must implement dataset_processing_func, try_to_guess_data_keys, and try_to_guess_ref_keys to handle task-specific tokenization, data column mapping, and label detection.
Usage
Do not instantiate TaskProcessor directly. Use one of its concrete subclasses (TextClassificationProcessing, TokenClassificationProcessing, QuestionAnsweringProcessing, ImageClassificationProcessing) or obtain one via TaskProcessorsManager.for_task().
Code Reference
Source Location
- Repository: Huggingface_Optimum
- File: optimum/utils/preprocessing/base.py
- Lines: 1-251
Signature
class TaskProcessor(ABC):
ACCEPTED_PREPROCESSOR_CLASSES: Tuple[Type, ...]
DEFAULT_DATASET_ARGS: Union[str, Dict[str, Any]]
DEFAULT_DATASET_DATA_KEYS: Dict[str, str]
ALLOWED_DATA_KEY_NAMES: Set[str]
DEFAULT_REF_KEYS: List[str]
def __init__(
self,
config: "PretrainedConfig",
preprocessor: Preprocessor,
preprocessor_kwargs: Optional[Dict[str, Any]] = None,
):
"""
Args:
config: The model config.
preprocessor: Tokenizer or image processor.
preprocessor_kwargs: Extra keyword arguments for the preprocessor.
"""
@abstractmethod
def dataset_processing_func(
self, example: Dict[str, Any], data_keys: Dict[str, str], ref_keys: Optional[List[str]] = None
) -> Dict[str, Any]: ...
def prepare_dataset(
self,
dataset: Union["DatasetDict", "Dataset"],
data_keys: Dict[str, str],
ref_keys: Optional[List[str]] = None,
split: Optional[str] = None,
) -> Union["DatasetDict", "Dataset"]: ...
def load_dataset(self, path: str, ...) -> Union["DatasetDict", "Dataset"]: ...
def load_default_dataset(self, ...) -> Union["DatasetDict", "Dataset"]: ...
@abstractmethod
def try_to_guess_data_keys(self, column_names: List[str]) -> Optional[Dict[str, str]]: ...
@abstractmethod
def try_to_guess_ref_keys(self, column_names: List[str]) -> Optional[List[str]]: ...
Import
from optimum.utils.preprocessing.base import TaskProcessor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | PretrainedConfig | Yes | The model configuration |
| preprocessor | Preprocessor | Yes | A tokenizer or image processor matching ACCEPTED_PREPROCESSOR_CLASSES |
| preprocessor_kwargs | Dict[str, Any] | No | Additional kwargs passed to the preprocessor during processing |
Outputs
| Name | Type | Description |
|---|---|---|
| load_dataset() | Dataset or DatasetDict | Processed dataset ready for model consumption |
| prepare_dataset() | Dataset or DatasetDict | Dataset with processing function applied via map() |
Usage Examples
Using a Concrete Subclass
from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.text_classification import TextClassificationProcessing
config = AutoConfig.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = TextClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)
print(dataset.column_names)