Implementation:Microsoft LoRA Run PLM
| Knowledge Sources | |
|---|---|
| Domains | NLP, Language_Modeling |
| Last Updated | 2026-02-10 06:00 GMT |
Overview
HuggingFace Transformers example script for fine-tuning permutation language models (XLNet) on custom text datasets.
Description
run_plm.py fine-tunes permutation language models using the HuggingFace Trainer API. Unlike causal or masked language modeling, permutation language modeling (PLM) trains the model to predict tokens in a randomly permuted order, allowing it to capture bidirectional context without using explicit mask tokens. The script is designed for XLNet and uses XLNetLMHeadModel (rather than an Auto class) along with DataCollatorForPermutationLanguageModeling to generate the permutation-based training targets. The collator is parameterized by plm_probability (ratio of masked span length to context length, default 1/6) and max_span_length (maximum span of masked tokens, default 5). The script supports two data processing modes: line-by-line (each line is a separate sequence) and concatenation (texts are joined and split into fixed-length chunks). This script is part of the modified Transformers fork used by Microsoft LoRA for NLU experiments.
Usage
Use this script when fine-tuning an XLNet-style permutation language model on a custom text corpus. Supports both local files (CSV, JSON, TXT) and HuggingFace dataset hub datasets. If no validation split exists, the script automatically creates one from the training data. The line_by_line flag controls whether each line is treated as a separate sequence or whether all lines are concatenated and chunked. Supports checkpoint resumption, distributed training, mixed-precision (FP16), and TPU execution. Integrated with the LoRA-modified Transformers fork.
Code Reference
Source Location
- Repository: Microsoft_LoRA
- File: examples/NLU/examples/language-modeling/run_plm.py
- Lines: 1-460
Signature
# Script entry point via HfArgumentParser
# Key dataclasses:
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default=None)
config_name: Optional[str] = field(default=None)
tokenizer_name: Optional[str] = field(default=None)
cache_dir: Optional[str] = field(default=None)
use_fast_tokenizer: bool = field(default=True)
model_revision: str = field(default="main")
use_auth_token: bool = field(default=False)
@dataclass
class DataTrainingArguments:
dataset_name: Optional[str] = field(default=None)
dataset_config_name: Optional[str] = field(default=None)
train_file: Optional[str] = field(default=None)
validation_file: Optional[str] = field(default=None)
overwrite_cache: bool = field(default=False)
validation_split_percentage: Optional[int] = field(default=5)
max_seq_length: int = field(default=512)
preprocessing_num_workers: Optional[int] = field(default=None)
plm_probability: float = field(default=1/6)
max_span_length: int = field(default=5)
line_by_line: bool = field(default=False)
pad_to_max_length: bool = field(default=False)
max_train_samples: Optional[int] = field(default=None)
max_val_samples: Optional[int] = field(default=None)
Import
# Script is run directly, not imported
python examples/NLU/examples/language-modeling/run_plm.py \
--model_name_or_path xlnet-base-cased \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--output_dir /tmp/test-plm
Key Components
Model Loading
The script uses XLNetLMHeadModel directly (not AutoModelForCausalLM or AutoModelForMaskedLM) to load an XLNet model with a language modeling head. When no model_name_or_path is provided, it falls back to a default XLNetConfig() for training from scratch. TensorFlow checkpoints are detected automatically (.ckpt extension) and converted. After loading, the token embedding layer is resized to match the tokenizer vocabulary.
Note that ModelArguments in this script does not include model_type, unlike run_clm.py and run_mlm.py. The script defaults to XLNetConfig when no pretrained model is specified, since permutation language modeling is specific to the XLNet architecture.
Data Processing Pipeline
The script supports two processing modes:
Line-by-line mode (--line_by_line):
- Each non-empty line is tokenized as an individual sequence
- Sequences are truncated to max_seq_length and optionally padded to max length
Concatenation mode (default):
- All texts are tokenized
- Tokenized sequences are concatenated and split into fixed-length chunks via group_texts
- max_seq_length defaults to 512 (not dynamically inferred like run_clm.py or run_mlm.py)
- Remainders smaller than max_seq_length are dropped
Data Collator
DataCollatorForPermutationLanguageModeling handles the construction of permutation-based training targets at batch time. It is parameterized by:
- plm_probability (default: 1/6): the ratio of the length of a span of masked tokens to the surrounding context length
- max_span_length (default: 5): the maximum length of a contiguous span of masked tokens
This collator generates the permutation masks and target mappings that XLNet requires for its two-stream self-attention mechanism, where the model must predict each token based on a random permutation of the other tokens in the sequence.
Training Loop
The Trainer class handles training with the DataCollatorForPermutationLanguageModeling data collator. The script supports checkpoint resumption via get_last_checkpoint. After training completes, the model and tokenizer are saved, and train metrics are logged and persisted.
Evaluation
During evaluation, the script computes loss on the validation set and derives perplexity as math.exp(eval_loss). Both metrics are logged and saved to disk.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | No | Pretrained model name or path (defaults to XLNetConfig from scratch if not set) |
| dataset_name | str | No* | HuggingFace dataset name (alternative to train_file) |
| dataset_config_name | str | No | Configuration name for the HuggingFace dataset |
| train_file | str | No* | Path to training text file (CSV, JSON, or TXT; alternative to dataset_name) |
| validation_file | str | No | Path to validation text file |
| max_seq_length | int | No | Maximum sequence length after tokenization (default: 512) |
| plm_probability | float | No | Ratio of masked span length to context length (default: 1/6) |
| max_span_length | int | No | Maximum length of a span of masked tokens (default: 5) |
| line_by_line | bool | No | Treat each line as a separate sequence (default: False) |
| pad_to_max_length | bool | No | Pad all samples to max_seq_length (default: False) |
| output_dir | str | Yes | Directory to save model checkpoints and metrics |
| max_train_samples | int | No | Truncate training examples to this count (for debugging) |
| max_val_samples | int | No | Truncate validation examples to this count (for debugging) |
| validation_split_percentage | int | No | Percentage of train set used as validation if no validation split exists (default: 5) |
| preprocessing_num_workers | int | No | Number of processes for data preprocessing |
| overwrite_cache | bool | No | Whether to overwrite cached preprocessed datasets (default: False) |
* Either dataset_name or train_file/validation_file must be provided.
Outputs
| Name | Type | Description |
|---|---|---|
| model checkpoints | Files | Saved to output_dir; includes model weights and tokenizer files |
| training metrics | Dict/JSON | Loss, train_samples logged and saved as train_results.json |
| eval metrics | Dict/JSON | eval_loss, eval_samples, perplexity saved as eval_results.json |
| trainer state | JSON | Trainer state (optimizer, scheduler, step) saved for checkpoint resumption |
Usage Examples
Fine-tune XLNet on WikiText
python examples/NLU/examples/language-modeling/run_plm.py \
--model_name_or_path xlnet-base-cased \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir ./output/plm-xlnet \
--per_device_train_batch_size 8 \
--num_train_epochs 3
Fine-tune with Custom PLM Parameters
python examples/NLU/examples/language-modeling/run_plm.py \
--model_name_or_path xlnet-base-cased \
--train_file ./data/train.txt \
--validation_file ./data/valid.txt \
--do_train \
--do_eval \
--max_seq_length 256 \
--plm_probability 0.2 \
--max_span_length 10 \
--output_dir ./output/plm-custom \
--overwrite_output_dir
Fine-tune with Line-by-Line Processing
python examples/NLU/examples/language-modeling/run_plm.py \
--model_name_or_path xlnet-base-cased \
--train_file ./data/train.txt \
--do_train \
--line_by_line \
--pad_to_max_length \
--max_seq_length 128 \
--output_dir ./output/plm-linebyline
Load Arguments from JSON
python examples/NLU/examples/language-modeling/run_plm.py config.json
Internal Details
Minimum Version Check
The script enforces a minimum Transformers version of 4.4.0 via check_min_version("4.4.0") at import time.
Comparison with Other Language Modeling Scripts
| Feature | run_clm.py | run_mlm.py | run_plm.py |
|---|---|---|---|
| Objective | Next-token prediction | Masked token prediction | Permutation-order prediction |
| Model class | AutoModelForCausalLM | AutoModelForMaskedLM | XLNetLMHeadModel |
| Data collator | default_data_collator | DataCollatorForLanguageModeling | DataCollatorForPermutationLanguageModeling |
| Model types | GPT-2, GPT, CTRL | BERT, RoBERTa, ALBERT | XLNet |
| Default seq length | 1024 (block_size) | 1024 (max_seq_length) | 512 (max_seq_length) |
| Masking params | None | mlm_probability (0.15) | plm_probability (1/6), max_span_length (5) |
| line_by_line | No | Yes | Yes |
Permutation Language Modeling
XLNet's permutation language modeling factorizes the joint probability of a sequence over all possible permutation orders. During training, DataCollatorForPermutationLanguageModeling selects contiguous spans of tokens (up to max_span_length) to mask, with the total masked length proportional to plm_probability times the sequence length. The model uses two-stream self-attention (content stream and query stream) to predict masked tokens while attending to other tokens in the permuted order.
TPU Support
The _mp_fn(index) function provides an entry point for xla_spawn to enable TPU execution.