Implementation:Huggingface Optimum TokenClassificationProcessing
| Knowledge Sources | |
|---|---|
| Domains | Preprocessing, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for preprocessing token classification datasets with word-level tokenization provided by the Huggingface Optimum library.
Description
TokenClassificationProcessing is a TaskProcessor subclass for token-level classification tasks (NER, POS tagging, chunking). It tokenizes pre-split word sequences using `is_split_into_words=True` to handle subword tokenization correctly. The default dataset is CoNLL-2003.
Usage
Use this processor when benchmarking token classification models. It automatically detects "tokens", "text", or "sentence" columns and tag columns from dataset column names.
Code Reference
Source Location
- Repository: Huggingface_Optimum
- File: optimum/utils/preprocessing/token_classification.py
- Lines: 1-101
Signature
class TokenClassificationProcessing(TaskProcessor):
ACCEPTED_PREPROCESSOR_CLASSES = (PreTrainedTokenizerBase,)
DEFAULT_DATASET_ARGS = "conll2003"
DEFAULT_DATASET_DATA_KEYS = {"primary": "tokens"}
ALLOWED_DATA_KEY_NAMES = {"primary"}
DEFAULT_REF_KEYS = ["ner_tags", "pos_tags", "chunk_tags"]
Import
from optimum.utils.preprocessing.token_classification import TokenClassificationProcessing
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | PretrainedConfig | Yes | The model configuration |
| preprocessor | PreTrainedTokenizerBase | Yes | Tokenizer for the model |
| preprocessor_kwargs | Dict[str, Any] | No | Override defaults (padding, truncation, max_length, is_split_into_words) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset_processing_func output | Dict | Tokenized inputs from pre-split word sequences |
Usage Examples
from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.token_classification import TokenClassificationProcessing
config = AutoConfig.from_pretrained("dslim/bert-base-NER")
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
processor = TokenClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)