Implementation:Microsoft LoRA Run CLM

Knowledge Sources	Microsoft_LoRA HuggingFace Transformers
Domains	NLP, Language_Modeling
Last Updated	2026-02-10 06:00 GMT

Overview

HuggingFace Transformers example script for fine-tuning causal language models (GPT-2, GPT, CTRL) on custom text datasets.

Description

run_clm.py fine-tunes auto-regressive language models using the HuggingFace Trainer API. It supports any model from the HuggingFace Model Hub that has a causal LM head (e.g., GPT-2, GPT, CTRL), loaded via AutoModelForCausalLM. The script handles dataset loading via the datasets library, tokenization, and training with configurable hyperparameters through HfArgumentParser. Text data is tokenized and then concatenated into fixed-length blocks (controlled by block_size) before being fed to the model. Labels are a copy of input_ids, enabling the standard causal LM next-token-prediction objective. Training and evaluation are orchestrated by the Trainer class with default_data_collator. This script is part of the modified Transformers fork used by Microsoft LoRA for NLU experiments.

Usage

Use this script when fine-tuning a causal language model on a custom text corpus. Supports both local files (CSV, JSON, TXT) and HuggingFace dataset hub datasets. If no validation split exists in the dataset, the script automatically creates one from the training data using a configurable percentage. The script supports checkpoint resumption, distributed training, mixed-precision (FP16), and TPU execution via _mp_fn. Integrated with the LoRA-modified Transformers fork.

Code Reference

Source Location

Repository: Microsoft_LoRA
File: examples/NLU/examples/language-modeling/run_clm.py
Lines: 1-444

Signature

# Script entry point via HfArgumentParser
# Key dataclasses:
@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default=None)
    model_type: Optional[str] = field(default=None)
    config_name: Optional[str] = field(default=None)
    tokenizer_name: Optional[str] = field(default=None)
    cache_dir: Optional[str] = field(default=None)
    use_fast_tokenizer: bool = field(default=True)
    model_revision: str = field(default="main")
    use_auth_token: bool = field(default=False)

@dataclass
class DataTrainingArguments:
    dataset_name: Optional[str] = field(default=None)
    dataset_config_name: Optional[str] = field(default=None)
    train_file: Optional[str] = field(default=None)
    validation_file: Optional[str] = field(default=None)
    max_train_samples: Optional[int] = field(default=None)
    max_val_samples: Optional[int] = field(default=None)
    block_size: Optional[int] = field(default=None)
    overwrite_cache: bool = field(default=False)
    validation_split_percentage: Optional[int] = field(default=5)
    preprocessing_num_workers: Optional[int] = field(default=None)

Import

# Script is run directly, not imported
python examples/NLU/examples/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --output_dir /tmp/test-clm

Key Components

Model Loading

The script uses AutoModelForCausalLM to load any compatible causal language model. It supports loading from a pretrained checkpoint via model_name_or_path or training from scratch using a model_type and CONFIG_MAPPING. TensorFlow checkpoints are detected automatically (.ckpt extension) and converted. After loading, the token embedding layer is resized to match the tokenizer vocabulary with model.resize_token_embeddings(len(tokenizer)).

Data Processing Pipeline

Datasets are loaded via load_dataset from the HuggingFace datasets library
All texts are tokenized using the tokenize_function (uses the text column or the first column)
Tokenized sequences are concatenated and split into fixed-length chunks via group_texts, where block_size defaults to 1024 if the model's model_max_length exceeds that threshold
Labels are set to a copy of input_ids so the model learns to predict the next token

Training Loop

The Trainer class handles training with default_data_collator. The script supports checkpoint resumption via get_last_checkpoint. After training completes, the model and tokenizer are saved, and train metrics (including sample count) are logged and persisted.

Evaluation

During evaluation, the script computes loss on the validation set and derives perplexity as math.exp(eval_loss). Both metrics are logged and saved to disk.

I/O Contract

Inputs

Name	Type	Required	Description
model_name_or_path	str	No*	Pretrained model name or path (required unless model_type is set for training from scratch)
model_type	str	No*	Model type for training from scratch (e.g., gpt2, ctrl, openai-gpt)
dataset_name	str	No**	HuggingFace dataset name (alternative to train_file)
dataset_config_name	str	No	Configuration name for the HuggingFace dataset
train_file	str	No**	Path to training text file (CSV, JSON, or TXT; alternative to dataset_name)
validation_file	str	No	Path to validation text file
block_size	int	No	Sequence length for tokenized text blocks (defaults to 1024)
output_dir	str	Yes	Directory to save model checkpoints and metrics
max_train_samples	int	No	Truncate training examples to this count (for debugging)
max_val_samples	int	No	Truncate validation examples to this count (for debugging)
validation_split_percentage	int	No	Percentage of train set used as validation if no validation split exists (default: 5)
preprocessing_num_workers	int	No	Number of processes for data preprocessing
overwrite_cache	bool	No	Whether to overwrite cached preprocessed datasets (default: False)

* Either model_name_or_path or model_type must be provided.

** Either dataset_name or train_file/validation_file must be provided.

Outputs

Name	Type	Description
model checkpoints	Files	Saved to output_dir; includes model weights and tokenizer files
training metrics	Dict/JSON	Loss, train_samples logged and saved as train_results.json
eval metrics	Dict/JSON	eval_loss, eval_samples, perplexity saved as eval_results.json
trainer state	JSON	Trainer state (optimizer, scheduler, step) saved for checkpoint resumption

Usage Examples

Fine-tune GPT-2 on WikiText

python examples/NLU/examples/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir ./output/clm-gpt2 \
    --per_device_train_batch_size 8 \
    --num_train_epochs 3

Fine-tune on a Custom Text File

python examples/NLU/examples/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --train_file ./data/train.txt \
    --validation_file ./data/valid.txt \
    --do_train \
    --do_eval \
    --block_size 512 \
    --output_dir ./output/clm-custom \
    --overwrite_output_dir

Load Arguments from JSON

python examples/NLU/examples/language-modeling/run_clm.py config.json

Internal Details

Minimum Version Check

The script enforces a minimum Transformers version of 4.4.0 via check_min_version("4.4.0") at import time.

Supported Model Types

The list of supported model types is dynamically derived from MODEL_FOR_CAUSAL_LM_MAPPING, which includes architectures like GPT-2, GPT, CTRL, and others registered in the Transformers model mapping.

TPU Support

The _mp_fn(index) function provides an entry point for xla_spawn to enable TPU execution.

Related Pages

Environment:Microsoft_LoRA_NLU_Conda_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment