Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Run NER

From Leeroopedia
Revision as of 15:43, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_LoRA_Run_NER.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Implementation metadata

Overview

run_ner.py is a modern token classification fine-tuning script for Named Entity Recognition (NER) and similar tasks using AutoModelForTokenClassification, DataCollatorForTokenClassification, and the seqeval metric.

Description

This script fine-tunes transformer models on token-level classification tasks such as NER, part-of-speech tagging, and chunking. It handles the complexities of subword tokenization alignment -- mapping word-level labels to subword tokens produced by the tokenizer.

Key implementation details:

  • Fast tokenizer requirement: Requires PreTrainedTokenizerFast because it uses word_ids() to align subword tokens back to original words for label assignment.
  • Label discovery: Supports two modes:
    • If the label column uses ClassLabel from the datasets library, extracts label names directly from features[label_column_name].feature.names.
    • Otherwise, iterates over the training data to discover unique labels, sorts them, and builds a label_to_id mapping.
  • Subword label alignment: The tokenize_and_align_labels() function:
    • Uses is_split_into_words=True since inputs are pre-tokenized word lists.
    • Assigns -100 (ignored in loss) to special tokens (word_id is None).
    • For the first subword token of each word, assigns the corresponding label.
    • For subsequent subword tokens of the same word, assigns either the label (if label_all_tokens=True) or -100.
  • Column detection: Auto-detects tokens column for text and {task_name}_tags column for labels (e.g., ner_tags, pos_tags).
  • Data collation: Uses DataCollatorForTokenClassification for dynamic padding of variable-length token sequences.
  • Metrics: Uses the seqeval metric for entity-level evaluation. Supports two reporting modes:
    • Default: Overall precision, recall, F1, and accuracy.
    • Entity-level (--return_entity_level_metrics): Per-entity-type metrics unpacked from nested dictionaries.
  • Three-phase pipeline: Supports do_train, do_eval, and do_predict. Test predictions are saved as space-separated label sequences in test_predictions.txt.

Usage

Use this script when you need to:

  • Fine-tune models on NER, POS tagging, or chunking tasks
  • Handle subword-to-word label alignment for token classification
  • Train on standard datasets (CoNLL-2003, etc.) or custom CSV/JSON token-level data

Code Reference

Source Location

Property Value
File examples/NLU/examples/token-classification/run_ner.py
Lines 501
Module run_ner
Entry Point main()

Signature/CLI

python run_ner.py \
    --model_name_or_path MODEL_NAME \
    --dataset_name DATASET_NAME \
    --output_dir OUTPUT_DIR \
    --do_train \
    --do_eval \
    [--do_predict] \
    [--dataset_config_name CONFIG] \
    [--task_name ner] \
    [--train_file TRAIN_FILE] \
    [--validation_file VALIDATION_FILE] \
    [--test_file TEST_FILE] \
    [--pad_to_max_length] \
    [--label_all_tokens] \
    [--return_entity_level_metrics] \
    [--max_train_samples N] \
    [--max_val_samples N] \
    [--max_test_samples N]

Import

from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    HfArgumentParser,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    set_seed,
)
from datasets import ClassLabel, load_dataset, load_metric

I/O Contract

Inputs

Parameter Type Required Default Description
--model_name_or_path str Yes - Pretrained model name or path
--output_dir str Yes - Directory for checkpoints and results
--dataset_name str No None HuggingFace dataset name (e.g., conll2003)
--task_name str No ner Task name used for label column detection ({task}_tags)
--train_file str No None Custom CSV/JSON training file
--validation_file str No None Custom CSV/JSON validation file
--test_file str No None Custom CSV/JSON test file
--pad_to_max_length flag No False Pad to model max length (required for TPU)
--label_all_tokens flag No False Label all subword tokens (not just first)
--return_entity_level_metrics flag No False Report per-entity-type metrics
--max_train_samples int No None Truncate training set for debugging
--max_val_samples int No None Truncate validation set for debugging
--max_test_samples int No None Truncate test set for debugging

Outputs

Output Location Description
Trained model {output_dir}/ Saved model, config, and tokenizer
Training metrics {output_dir}/train_results.json Loss, runtime, samples per second
Evaluation metrics {output_dir}/eval_results.json Precision, recall, F1, accuracy (overall or per-entity)
Test metrics {output_dir}/test_results.json Test set seqeval metrics
Test predictions {output_dir}/test_predictions.txt Predicted labels, space-separated per line

Usage Examples

Fine-tune on CoNLL-2003 NER

python examples/NLU/examples/token-classification/run_ner.py \
    --model_name_or_path bert-base-cased \
    --dataset_name conll2003 \
    --do_train \
    --do_eval \
    --do_predict \
    --per_device_train_batch_size 16 \
    --learning_rate 2e-5 \
    --num_train_epochs 3 \
    --output_dir /tmp/ner_output

Fine-tune with entity-level metrics

python examples/NLU/examples/token-classification/run_ner.py \
    --model_name_or_path roberta-base \
    --dataset_name conll2003 \
    --do_train \
    --do_eval \
    --return_entity_level_metrics \
    --per_device_train_batch_size 32 \
    --learning_rate 5e-5 \
    --num_train_epochs 5 \
    --output_dir /tmp/ner_entity_metrics

POS tagging with custom data

python examples/NLU/examples/token-classification/run_ner.py \
    --model_name_or_path bert-base-uncased \
    --task_name pos \
    --train_file /path/to/train.json \
    --validation_file /path/to/val.json \
    --test_file /path/to/test.json \
    --do_train \
    --do_eval \
    --do_predict \
    --label_all_tokens \
    --output_dir /tmp/pos_output

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment