Implementation:Microsoft LoRA Legacy Run Language Modeling
Template:Implementation metadata
Overview
General-purpose language model fine-tuning script supporting causal (CLM), masked (MLM), and permutation (PLM) language modeling across GPT, GPT-2, CTRL, BERT, RoBERTa, and XLNet architectures.
Description
run_language_modeling.py is a legacy HuggingFace Transformers example script included in the Microsoft LoRA NLU example directory. It uses the HuggingFace Trainer API with HfArgumentParser for structured argument parsing via three dataclasses: ModelArguments, DataTrainingArguments, and the built-in TrainingArguments. The script automatically selects the appropriate data collator based on the model type and configuration flags:
- XLNet:
DataCollatorForPermutationLanguageModelingwith configurableplm_probabilityandmax_span_length - MLM with whole word masking:
DataCollatorForWholeWordMask - Standard MLM:
DataCollatorForLanguageModelingwith configurablemlm_probability - CLM (default):
DataCollatorForLanguageModelingwithmlm=False
Dataset loading supports both single-file and multi-file (glob pattern) training data via TextDataset or LineByLineTextDataset, with Chinese whole word mask support through LineByLineWithRefDataset. The script supports both fine-tuning existing models and training from scratch, and computes perplexity on the evaluation set.
This script is part of the HuggingFace Transformers library (legacy examples) bundled in the Microsoft LoRA repository.
⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.
Usage
Use this script when you need to fine-tune or train from scratch a language model using causal, masked, or permutation language modeling objectives. It is suitable for domain adaptation of pretrained models on custom text corpora.
Code Reference
Source Location
| Property | Value |
|---|---|
| File path | examples/NLU/examples/legacy/run_language_modeling.py
|
| Lines | 364 |
| Module | run_language_modeling
|
Key Classes and Functions
| Name | Type | Signature / Description |
|---|---|---|
ModelArguments |
dataclass | Fields: model_name_or_path, model_type, config_name, tokenizer_name, cache_dir
|
DataTrainingArguments |
dataclass | Fields: train_data_file, train_data_files, eval_data_file, train_ref_file, eval_ref_file, line_by_line, mlm, whole_word_mask, mlm_probability, plm_probability, max_span_length, block_size, overwrite_cache
|
get_dataset |
function | get_dataset(args, tokenizer, evaluate=False, cache_dir=None) -- returns TextDataset, LineByLineTextDataset, LineByLineWithRefDataset, or ConcatDataset
|
main |
function | Entry point: parses args, builds model/tokenizer/data collator, runs Trainer for training and evaluation |
_mp_fn |
function | TPU spawn entry point |
CLI Usage
python run_language_modeling.py \ --model_name_or_path gpt2 \ --train_data_file /path/to/train.txt \ --eval_data_file /path/to/eval.txt \ --output_dir /path/to/output \ --do_train \ --do_eval \ --per_device_train_batch_size 4
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
--model_name_or_path |
Optional[str] |
Pretrained model checkpoint or None to train from scratch |
--model_type |
Optional[str] |
Model type for training from scratch (e.g., gpt2, bert, xlnet)
|
--train_data_file |
Optional[str] |
Single text file for training |
--train_data_files |
Optional[str] |
Glob pattern for multiple training files |
--eval_data_file |
Optional[str] |
Text file for evaluation (required if --do_eval)
|
--line_by_line |
flag | Treat each line as a separate sequence |
--mlm |
flag | Use masked language modeling loss (required for BERT/RoBERTa) |
--whole_word_mask |
flag | Use whole word masking (requires --mlm)
|
--mlm_probability |
float (default 0.15) |
Ratio of tokens to mask for MLM |
--plm_probability |
float (default 1/6) |
Span-to-context ratio for XLNet PLM |
--block_size |
int (default -1) |
Input sequence length; defaults to model max length |
Outputs
| Output | Type | Description |
|---|---|---|
| Saved model | directory | Model, tokenizer, and config saved to output_dir
|
eval_results_lm.txt |
text file | Perplexity evaluation results |
| Return value | Dict[str, float] |
Dictionary with perplexity key
|
Usage Examples
Fine-tuning GPT-2 with Causal LM
python run_language_modeling.py \ --model_name_or_path gpt2 \ --train_data_file /data/wikitext/train.txt \ --eval_data_file /data/wikitext/valid.txt \ --output_dir /output/gpt2_finetuned \ --do_train \ --do_eval \ --per_device_train_batch_size 4 \ --num_train_epochs 3 \ --block_size 512 \ --overwrite_output_dir
Fine-tuning BERT with Masked LM
python run_language_modeling.py \ --model_name_or_path bert-base-uncased \ --train_data_file /data/corpus/train.txt \ --eval_data_file /data/corpus/valid.txt \ --output_dir /output/bert_mlm \ --do_train \ --do_eval \ --mlm \ --mlm_probability 0.15 \ --line_by_line \ --per_device_train_batch_size 8 \ --overwrite_output_dir
Fine-tuning XLNet with Permutation LM
python run_language_modeling.py \ --model_name_or_path xlnet-base-cased \ --train_data_file /data/corpus/train.txt \ --eval_data_file /data/corpus/valid.txt \ --output_dir /output/xlnet_plm \ --do_train \ --do_eval \ --per_device_train_batch_size 4 \ --plm_probability 0.1667 \ --max_span_length 5 \ --overwrite_output_dir
Training from Multiple Files
python run_language_modeling.py \ --model_name_or_path gpt2 \ --train_data_files "/data/shards/train_*.txt" \ --eval_data_file /data/shards/valid.txt \ --output_dir /output/gpt2_multi \ --do_train \ --do_eval \ --overwrite_output_dir