Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Legacy Run Language Modeling

From Leeroopedia


Template:Implementation metadata

Overview

General-purpose language model fine-tuning script supporting causal (CLM), masked (MLM), and permutation (PLM) language modeling across GPT, GPT-2, CTRL, BERT, RoBERTa, and XLNet architectures.

Description

run_language_modeling.py is a legacy HuggingFace Transformers example script included in the Microsoft LoRA NLU example directory. It uses the HuggingFace Trainer API with HfArgumentParser for structured argument parsing via three dataclasses: ModelArguments, DataTrainingArguments, and the built-in TrainingArguments. The script automatically selects the appropriate data collator based on the model type and configuration flags:

  • XLNet: DataCollatorForPermutationLanguageModeling with configurable plm_probability and max_span_length
  • MLM with whole word masking: DataCollatorForWholeWordMask
  • Standard MLM: DataCollatorForLanguageModeling with configurable mlm_probability
  • CLM (default): DataCollatorForLanguageModeling with mlm=False

Dataset loading supports both single-file and multi-file (glob pattern) training data via TextDataset or LineByLineTextDataset, with Chinese whole word mask support through LineByLineWithRefDataset. The script supports both fine-tuning existing models and training from scratch, and computes perplexity on the evaluation set.

This script is part of the HuggingFace Transformers library (legacy examples) bundled in the Microsoft LoRA repository.

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when you need to fine-tune or train from scratch a language model using causal, masked, or permutation language modeling objectives. It is suitable for domain adaptation of pretrained models on custom text corpora.

Code Reference

Source Location

Property Value
File path examples/NLU/examples/legacy/run_language_modeling.py
Lines 364
Module run_language_modeling

Key Classes and Functions

Name Type Signature / Description
ModelArguments dataclass Fields: model_name_or_path, model_type, config_name, tokenizer_name, cache_dir
DataTrainingArguments dataclass Fields: train_data_file, train_data_files, eval_data_file, train_ref_file, eval_ref_file, line_by_line, mlm, whole_word_mask, mlm_probability, plm_probability, max_span_length, block_size, overwrite_cache
get_dataset function get_dataset(args, tokenizer, evaluate=False, cache_dir=None) -- returns TextDataset, LineByLineTextDataset, LineByLineWithRefDataset, or ConcatDataset
main function Entry point: parses args, builds model/tokenizer/data collator, runs Trainer for training and evaluation
_mp_fn function TPU spawn entry point

CLI Usage

python run_language_modeling.py \
  --model_name_or_path gpt2 \
  --train_data_file /path/to/train.txt \
  --eval_data_file /path/to/eval.txt \
  --output_dir /path/to/output \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 4

I/O Contract

Inputs

Input Type Description
--model_name_or_path Optional[str] Pretrained model checkpoint or None to train from scratch
--model_type Optional[str] Model type for training from scratch (e.g., gpt2, bert, xlnet)
--train_data_file Optional[str] Single text file for training
--train_data_files Optional[str] Glob pattern for multiple training files
--eval_data_file Optional[str] Text file for evaluation (required if --do_eval)
--line_by_line flag Treat each line as a separate sequence
--mlm flag Use masked language modeling loss (required for BERT/RoBERTa)
--whole_word_mask flag Use whole word masking (requires --mlm)
--mlm_probability float (default 0.15) Ratio of tokens to mask for MLM
--plm_probability float (default 1/6) Span-to-context ratio for XLNet PLM
--block_size int (default -1) Input sequence length; defaults to model max length

Outputs

Output Type Description
Saved model directory Model, tokenizer, and config saved to output_dir
eval_results_lm.txt text file Perplexity evaluation results
Return value Dict[str, float] Dictionary with perplexity key

Usage Examples

Fine-tuning GPT-2 with Causal LM

python run_language_modeling.py \
  --model_name_or_path gpt2 \
  --train_data_file /data/wikitext/train.txt \
  --eval_data_file /data/wikitext/valid.txt \
  --output_dir /output/gpt2_finetuned \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 4 \
  --num_train_epochs 3 \
  --block_size 512 \
  --overwrite_output_dir

Fine-tuning BERT with Masked LM

python run_language_modeling.py \
  --model_name_or_path bert-base-uncased \
  --train_data_file /data/corpus/train.txt \
  --eval_data_file /data/corpus/valid.txt \
  --output_dir /output/bert_mlm \
  --do_train \
  --do_eval \
  --mlm \
  --mlm_probability 0.15 \
  --line_by_line \
  --per_device_train_batch_size 8 \
  --overwrite_output_dir

Fine-tuning XLNet with Permutation LM

python run_language_modeling.py \
  --model_name_or_path xlnet-base-cased \
  --train_data_file /data/corpus/train.txt \
  --eval_data_file /data/corpus/valid.txt \
  --output_dir /output/xlnet_plm \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 4 \
  --plm_probability 0.1667 \
  --max_span_length 5 \
  --overwrite_output_dir

Training from Multiple Files

python run_language_modeling.py \
  --model_name_or_path gpt2 \
  --train_data_files "/data/shards/train_*.txt" \
  --eval_data_file /data/shards/valid.txt \
  --output_dir /output/gpt2_multi \
  --do_train \
  --do_eval \
  --overwrite_output_dir

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment