Implementation:Microsoft LoRA Legacy Run Language Modeling

Overview

General-purpose language model fine-tuning script supporting causal (CLM), masked (MLM), and permutation (PLM) language modeling across GPT, GPT-2, CTRL, BERT, RoBERTa, and XLNet architectures.

Description

run_language_modeling.py is a legacy HuggingFace Transformers example script included in the Microsoft LoRA NLU example directory. It uses the HuggingFace Trainer API with HfArgumentParser for structured argument parsing via three dataclasses: ModelArguments, DataTrainingArguments, and the built-in TrainingArguments. The script automatically selects the appropriate data collator based on the model type and configuration flags:

XLNet: DataCollatorForPermutationLanguageModeling with configurable plm_probability and max_span_length
MLM with whole word masking: DataCollatorForWholeWordMask
Standard MLM: DataCollatorForLanguageModeling with configurable mlm_probability
CLM (default): DataCollatorForLanguageModeling with mlm=False

Dataset loading supports both single-file and multi-file (glob pattern) training data via TextDataset or LineByLineTextDataset, with Chinese whole word mask support through LineByLineWithRefDataset. The script supports both fine-tuning existing models and training from scratch, and computes perplexity on the evaluation set.

This script is part of the HuggingFace Transformers library (legacy examples) bundled in the Microsoft LoRA repository.

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when you need to fine-tune or train from scratch a language model using causal, masked, or permutation language modeling objectives. It is suitable for domain adaptation of pretrained models on custom text corpora.

Code Reference

Source Location

Property	Value
File path	`examples/NLU/examples/legacy/run_language_modeling.py`
Lines	364
Module	`run_language_modeling`

Key Classes and Functions

Name	Type	Signature / Description
`ModelArguments`	dataclass	Fields: `model_name_or_path`, `model_type`, `config_name`, `tokenizer_name`, `cache_dir`
`DataTrainingArguments`	dataclass	Fields: `train_data_file`, `train_data_files`, `eval_data_file`, `train_ref_file`, `eval_ref_file`, `line_by_line`, `mlm`, `whole_word_mask`, `mlm_probability`, `plm_probability`, `max_span_length`, `block_size`, `overwrite_cache`
`get_dataset`	function	`get_dataset(args, tokenizer, evaluate=False, cache_dir=None)` -- returns `TextDataset`, `LineByLineTextDataset`, `LineByLineWithRefDataset`, or `ConcatDataset`
`main`	function	Entry point: parses args, builds model/tokenizer/data collator, runs Trainer for training and evaluation
`_mp_fn`	function	TPU spawn entry point

CLI Usage

python run_language_modeling.py \
  --model_name_or_path gpt2 \
  --train_data_file /path/to/train.txt \
  --eval_data_file /path/to/eval.txt \
  --output_dir /path/to/output \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 4

I/O Contract

Inputs

Input	Type	Description
`--model_name_or_path`	`Optional[str]`	Pretrained model checkpoint or None to train from scratch
`--model_type`	`Optional[str]`	Model type for training from scratch (e.g., `gpt2`, `bert`, `xlnet`)
`--train_data_file`	`Optional[str]`	Single text file for training
`--train_data_files`	`Optional[str]`	Glob pattern for multiple training files
`--eval_data_file`	`Optional[str]`	Text file for evaluation (required if `--do_eval`)
`--line_by_line`	flag	Treat each line as a separate sequence
`--mlm`	flag	Use masked language modeling loss (required for BERT/RoBERTa)
`--whole_word_mask`	flag	Use whole word masking (requires `--mlm`)
`--mlm_probability`	`float` (default 0.15)	Ratio of tokens to mask for MLM
`--plm_probability`	`float` (default 1/6)	Span-to-context ratio for XLNet PLM
`--block_size`	`int` (default -1)	Input sequence length; defaults to model max length

Outputs

Output	Type	Description
Saved model	directory	Model, tokenizer, and config saved to `output_dir`
`eval_results_lm.txt`	text file	Perplexity evaluation results
Return value	`Dict[str, float]`	Dictionary with `perplexity` key

Usage Examples

Fine-tuning GPT-2 with Causal LM

python run_language_modeling.py \
  --model_name_or_path gpt2 \
  --train_data_file /data/wikitext/train.txt \
  --eval_data_file /data/wikitext/valid.txt \
  --output_dir /output/gpt2_finetuned \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 4 \
  --num_train_epochs 3 \
  --block_size 512 \
  --overwrite_output_dir

Fine-tuning BERT with Masked LM

python run_language_modeling.py \
  --model_name_or_path bert-base-uncased \
  --train_data_file /data/corpus/train.txt \
  --eval_data_file /data/corpus/valid.txt \
  --output_dir /output/bert_mlm \
  --do_train \
  --do_eval \
  --mlm \
  --mlm_probability 0.15 \
  --line_by_line \
  --per_device_train_batch_size 8 \
  --overwrite_output_dir

Fine-tuning XLNet with Permutation LM

python run_language_modeling.py \
  --model_name_or_path xlnet-base-cased \
  --train_data_file /data/corpus/train.txt \
  --eval_data_file /data/corpus/valid.txt \
  --output_dir /output/xlnet_plm \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 4 \
  --plm_probability 0.1667 \
  --max_span_length 5 \
  --overwrite_output_dir

Training from Multiple Files

python run_language_modeling.py \
  --model_name_or_path gpt2 \
  --train_data_files "/data/shards/train_*.txt" \
  --eval_data_file /data/shards/valid.txt \
  --output_dir /output/gpt2_multi \
  --do_train \
  --do_eval \
  --overwrite_output_dir

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment