Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Optimum TokenClassificationProcessing

From Leeroopedia
Revision as of 13:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Optimum_TokenClassificationProcessing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Preprocessing, NLP
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for preprocessing token classification datasets with word-level tokenization provided by the Huggingface Optimum library.

Description

TokenClassificationProcessing is a TaskProcessor subclass for token-level classification tasks (NER, POS tagging, chunking). It tokenizes pre-split word sequences using `is_split_into_words=True` to handle subword tokenization correctly. The default dataset is CoNLL-2003.

Usage

Use this processor when benchmarking token classification models. It automatically detects "tokens", "text", or "sentence" columns and tag columns from dataset column names.

Code Reference

Source Location

Signature

class TokenClassificationProcessing(TaskProcessor):
    ACCEPTED_PREPROCESSOR_CLASSES = (PreTrainedTokenizerBase,)
    DEFAULT_DATASET_ARGS = "conll2003"
    DEFAULT_DATASET_DATA_KEYS = {"primary": "tokens"}
    ALLOWED_DATA_KEY_NAMES = {"primary"}
    DEFAULT_REF_KEYS = ["ner_tags", "pos_tags", "chunk_tags"]

Import

from optimum.utils.preprocessing.token_classification import TokenClassificationProcessing

I/O Contract

Inputs

Name Type Required Description
config PretrainedConfig Yes The model configuration
preprocessor PreTrainedTokenizerBase Yes Tokenizer for the model
preprocessor_kwargs Dict[str, Any] No Override defaults (padding, truncation, max_length, is_split_into_words)

Outputs

Name Type Description
dataset_processing_func output Dict Tokenized inputs from pre-split word sequences

Usage Examples

from transformers import AutoConfig, AutoTokenizer
from optimum.utils.preprocessing.token_classification import TokenClassificationProcessing

config = AutoConfig.from_pretrained("dslim/bert-base-NER")
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

processor = TokenClassificationProcessing(config, tokenizer)
dataset = processor.load_default_dataset(load_smallest_split=True, num_samples=100)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment