Implementation:Microsoft LoRA Run MLM
| Knowledge Sources | |
|---|---|
| Domains | NLP, Language_Modeling |
| Last Updated | 2026-02-10 06:00 GMT |
Overview
HuggingFace Transformers example script for fine-tuning masked language models (BERT, RoBERTa, ALBERT) on custom text datasets.
Description
run_mlm.py fine-tunes masked language models using the HuggingFace Trainer API. It supports any model from the HuggingFace Model Hub that has a masked LM head (e.g., BERT, RoBERTa, ALBERT), loaded via AutoModelForMaskedLM. The masked language modeling objective randomly masks a configurable percentage of input tokens (default 15%) and trains the model to predict the original tokens. Token masking is handled at batch time by DataCollatorForLanguageModeling. The script supports two distinct data processing modes: line-by-line (each line is a separate sequence) and concatenation (texts are joined and split into fixed-length chunks). This script is part of the modified Transformers fork used by Microsoft LoRA for NLU experiments.
Usage
Use this script when fine-tuning a masked language model on a custom text corpus. Supports both local files (CSV, JSON, TXT) and HuggingFace dataset hub datasets. If no validation split exists in the dataset, the script automatically creates one from the training data. The line_by_line flag controls whether each line is treated as a separate sequence or whether all lines are concatenated and chunked. Supports checkpoint resumption, distributed training, mixed-precision (FP16), and TPU execution. Integrated with the LoRA-modified Transformers fork.
Code Reference
Source Location
- Repository: Microsoft_LoRA
- File: examples/NLU/examples/language-modeling/run_mlm.py
- Lines: 1-479
Signature
# Script entry point via HfArgumentParser
# Key dataclasses:
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default=None)
model_type: Optional[str] = field(default=None)
config_name: Optional[str] = field(default=None)
tokenizer_name: Optional[str] = field(default=None)
cache_dir: Optional[str] = field(default=None)
use_fast_tokenizer: bool = field(default=True)
model_revision: str = field(default="main")
use_auth_token: bool = field(default=False)
@dataclass
class DataTrainingArguments:
dataset_name: Optional[str] = field(default=None)
dataset_config_name: Optional[str] = field(default=None)
train_file: Optional[str] = field(default=None)
validation_file: Optional[str] = field(default=None)
overwrite_cache: bool = field(default=False)
validation_split_percentage: Optional[int] = field(default=5)
max_seq_length: Optional[int] = field(default=None)
preprocessing_num_workers: Optional[int] = field(default=None)
mlm_probability: float = field(default=0.15)
line_by_line: bool = field(default=False)
pad_to_max_length: bool = field(default=False)
max_train_samples: Optional[int] = field(default=None)
max_val_samples: Optional[int] = field(default=None)
Import
# Script is run directly, not imported
python examples/NLU/examples/language-modeling/run_mlm.py \
--model_name_or_path bert-base-uncased \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--output_dir /tmp/test-mlm
Key Components
Model Loading
The script uses AutoModelForMaskedLM to load any compatible masked language model. It supports loading from a pretrained checkpoint via model_name_or_path or training from scratch using a model_type and CONFIG_MAPPING (derived from MODEL_FOR_MASKED_LM_MAPPING). TensorFlow checkpoints are detected automatically (.ckpt extension) and converted. After loading, the token embedding layer is resized to match the tokenizer vocabulary with model.resize_token_embeddings(len(tokenizer)).
Data Processing Pipeline
The script supports two processing modes:
Line-by-line mode (--line_by_line):
- Each non-empty line is tokenized as an individual sequence
- Sequences are truncated to max_seq_length and optionally padded to max length
- The return_special_tokens_mask=True flag is used so DataCollatorForLanguageModeling can avoid masking special tokens
Concatenation mode (default):
- All texts are tokenized with return_special_tokens_mask=True
- Tokenized sequences are concatenated and split into fixed-length chunks via group_texts, where max_seq_length defaults to 1024 if the model's model_max_length exceeds that threshold
- Remainders smaller than max_seq_length are dropped
Data Collator
DataCollatorForLanguageModeling handles dynamic masking at batch time. It randomly masks mlm_probability (default 15%) of input tokens according to the standard BERT masking strategy: 80% replaced with [MASK], 10% replaced with a random token, 10% kept unchanged. The special_tokens_mask ensures special tokens (CLS, SEP, PAD) are never masked.
Training Loop
The Trainer class handles training with the DataCollatorForLanguageModeling data collator. The script supports checkpoint resumption via get_last_checkpoint. After training completes, the model and tokenizer are saved, and train metrics are logged and persisted.
Evaluation
During evaluation, the script computes loss on the validation set and derives perplexity as math.exp(eval_loss). Both metrics are logged and saved to disk.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | No* | Pretrained model name or path (required unless model_type is set for training from scratch) |
| model_type | str | No* | Model type for training from scratch (e.g., bert, roberta, albert) |
| dataset_name | str | No** | HuggingFace dataset name (alternative to train_file) |
| dataset_config_name | str | No | Configuration name for the HuggingFace dataset |
| train_file | str | No** | Path to training text file (CSV, JSON, or TXT; alternative to dataset_name) |
| validation_file | str | No | Path to validation text file |
| max_seq_length | int | No | Maximum sequence length after tokenization (defaults to 1024 or model max) |
| mlm_probability | float | No | Ratio of tokens to mask (default: 0.15) |
| line_by_line | bool | No | Treat each line as a separate sequence (default: False) |
| pad_to_max_length | bool | No | Pad all samples to max_seq_length (default: False) |
| output_dir | str | Yes | Directory to save model checkpoints and metrics |
| max_train_samples | int | No | Truncate training examples to this count (for debugging) |
| max_val_samples | int | No | Truncate validation examples to this count (for debugging) |
| validation_split_percentage | int | No | Percentage of train set used as validation if no validation split exists (default: 5) |
| preprocessing_num_workers | int | No | Number of processes for data preprocessing |
| overwrite_cache | bool | No | Whether to overwrite cached preprocessed datasets (default: False) |
* Either model_name_or_path or model_type must be provided.
** Either dataset_name or train_file/validation_file must be provided.
Outputs
| Name | Type | Description |
|---|---|---|
| model checkpoints | Files | Saved to output_dir; includes model weights and tokenizer files |
| training metrics | Dict/JSON | Loss, train_samples logged and saved as train_results.json |
| eval metrics | Dict/JSON | eval_loss, eval_samples, perplexity saved as eval_results.json |
| trainer state | JSON | Trainer state (optimizer, scheduler, step) saved for checkpoint resumption |
Usage Examples
Fine-tune BERT on WikiText
python examples/NLU/examples/language-modeling/run_mlm.py \
--model_name_or_path bert-base-uncased \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir ./output/mlm-bert \
--per_device_train_batch_size 8 \
--num_train_epochs 3
Fine-tune RoBERTa with Line-by-Line Processing
python examples/NLU/examples/language-modeling/run_mlm.py \
--model_name_or_path roberta-base \
--train_file ./data/train.txt \
--validation_file ./data/valid.txt \
--do_train \
--do_eval \
--line_by_line \
--max_seq_length 128 \
--mlm_probability 0.15 \
--output_dir ./output/mlm-roberta \
--overwrite_output_dir
Load Arguments from JSON
python examples/NLU/examples/language-modeling/run_mlm.py config.json
Internal Details
Minimum Version Check
The script enforces a minimum Transformers version of 4.4.0 via check_min_version("4.4.0") at import time.
Supported Model Types
The list of supported model types is dynamically derived from MODEL_FOR_MASKED_LM_MAPPING, which includes architectures like BERT, RoBERTa, ALBERT, DistilBERT, ELECTRA, and others registered in the Transformers model mapping.
Difference from run_clm.py
Unlike run_clm.py which uses default_data_collator and sets labels equal to input_ids, run_mlm.py uses DataCollatorForLanguageModeling to dynamically mask tokens at each training step, implementing the masked language modeling (MLM) objective where the model predicts randomly masked tokens rather than the next token.
TPU Support
The _mp_fn(index) function provides an entry point for xla_spawn to enable TPU execution.