Implementation:Microsoft LoRA Run NER
Template:Implementation metadata
Overview
run_ner.py is a modern token classification fine-tuning script for Named Entity Recognition (NER) and similar tasks using AutoModelForTokenClassification, DataCollatorForTokenClassification, and the seqeval metric.
Description
This script fine-tunes transformer models on token-level classification tasks such as NER, part-of-speech tagging, and chunking. It handles the complexities of subword tokenization alignment -- mapping word-level labels to subword tokens produced by the tokenizer.
Key implementation details:
- Fast tokenizer requirement: Requires
PreTrainedTokenizerFastbecause it usesword_ids()to align subword tokens back to original words for label assignment. - Label discovery: Supports two modes:
- If the label column uses
ClassLabelfrom thedatasetslibrary, extracts label names directly fromfeatures[label_column_name].feature.names. - Otherwise, iterates over the training data to discover unique labels, sorts them, and builds a
label_to_idmapping.
- If the label column uses
- Subword label alignment: The
tokenize_and_align_labels()function:- Uses
is_split_into_words=Truesince inputs are pre-tokenized word lists. - Assigns -100 (ignored in loss) to special tokens (word_id is None).
- For the first subword token of each word, assigns the corresponding label.
- For subsequent subword tokens of the same word, assigns either the label (if
label_all_tokens=True) or -100.
- Uses
- Column detection: Auto-detects
tokenscolumn for text and{task_name}_tagscolumn for labels (e.g.,ner_tags,pos_tags). - Data collation: Uses
DataCollatorForTokenClassificationfor dynamic padding of variable-length token sequences. - Metrics: Uses the
seqevalmetric for entity-level evaluation. Supports two reporting modes:- Default: Overall precision, recall, F1, and accuracy.
- Entity-level (
--return_entity_level_metrics): Per-entity-type metrics unpacked from nested dictionaries.
- Three-phase pipeline: Supports
do_train,do_eval, anddo_predict. Test predictions are saved as space-separated label sequences intest_predictions.txt.
Usage
Use this script when you need to:
- Fine-tune models on NER, POS tagging, or chunking tasks
- Handle subword-to-word label alignment for token classification
- Train on standard datasets (CoNLL-2003, etc.) or custom CSV/JSON token-level data
Code Reference
Source Location
| Property | Value |
|---|---|
| File | examples/NLU/examples/token-classification/run_ner.py
|
| Lines | 501 |
| Module | run_ner
|
| Entry Point | main()
|
Signature/CLI
python run_ner.py \
--model_name_or_path MODEL_NAME \
--dataset_name DATASET_NAME \
--output_dir OUTPUT_DIR \
--do_train \
--do_eval \
[--do_predict] \
[--dataset_config_name CONFIG] \
[--task_name ner] \
[--train_file TRAIN_FILE] \
[--validation_file VALIDATION_FILE] \
[--test_file TEST_FILE] \
[--pad_to_max_length] \
[--label_all_tokens] \
[--return_entity_level_metrics] \
[--max_train_samples N] \
[--max_val_samples N] \
[--max_test_samples N]
Import
from transformers import (
AutoConfig,
AutoModelForTokenClassification,
AutoTokenizer,
DataCollatorForTokenClassification,
HfArgumentParser,
PreTrainedTokenizerFast,
Trainer,
TrainingArguments,
set_seed,
)
from datasets import ClassLabel, load_dataset, load_metric
I/O Contract
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--model_name_or_path |
str | Yes | - | Pretrained model name or path |
--output_dir |
str | Yes | - | Directory for checkpoints and results |
--dataset_name |
str | No | None | HuggingFace dataset name (e.g., conll2003)
|
--task_name |
str | No | ner | Task name used for label column detection ({task}_tags)
|
--train_file |
str | No | None | Custom CSV/JSON training file |
--validation_file |
str | No | None | Custom CSV/JSON validation file |
--test_file |
str | No | None | Custom CSV/JSON test file |
--pad_to_max_length |
flag | No | False | Pad to model max length (required for TPU) |
--label_all_tokens |
flag | No | False | Label all subword tokens (not just first) |
--return_entity_level_metrics |
flag | No | False | Report per-entity-type metrics |
--max_train_samples |
int | No | None | Truncate training set for debugging |
--max_val_samples |
int | No | None | Truncate validation set for debugging |
--max_test_samples |
int | No | None | Truncate test set for debugging |
Outputs
| Output | Location | Description |
|---|---|---|
| Trained model | {output_dir}/ |
Saved model, config, and tokenizer |
| Training metrics | {output_dir}/train_results.json |
Loss, runtime, samples per second |
| Evaluation metrics | {output_dir}/eval_results.json |
Precision, recall, F1, accuracy (overall or per-entity) |
| Test metrics | {output_dir}/test_results.json |
Test set seqeval metrics |
| Test predictions | {output_dir}/test_predictions.txt |
Predicted labels, space-separated per line |
Usage Examples
Fine-tune on CoNLL-2003 NER
python examples/NLU/examples/token-classification/run_ner.py \
--model_name_or_path bert-base-cased \
--dataset_name conll2003 \
--do_train \
--do_eval \
--do_predict \
--per_device_train_batch_size 16 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir /tmp/ner_output
Fine-tune with entity-level metrics
python examples/NLU/examples/token-classification/run_ner.py \
--model_name_or_path roberta-base \
--dataset_name conll2003 \
--do_train \
--do_eval \
--return_entity_level_metrics \
--per_device_train_batch_size 32 \
--learning_rate 5e-5 \
--num_train_epochs 5 \
--output_dir /tmp/ner_entity_metrics
POS tagging with custom data
python examples/NLU/examples/token-classification/run_ner.py \
--model_name_or_path bert-base-uncased \
--task_name pos \
--train_file /path/to/train.json \
--validation_file /path/to/val.json \
--test_file /path/to/test.json \
--do_train \
--do_eval \
--do_predict \
--label_all_tokens \
--output_dir /tmp/pos_output
Related Pages
- Environment:Microsoft_LoRA_NLU_Conda_Environment
- Implementation:Microsoft_LoRA_Run_GLUE_No_Trainer - Sentence-level classification counterpart
- Implementation:Microsoft_LoRA_Run_XNLI - Multilingual sequence classification