Principle:Microsoft LoRA HF Trainer LoRA Training

Overview

HF Trainer LoRA Training describes how the HuggingFace Trainer class orchestrates LoRA fine-tuning on GLUE benchmark tasks within the microsoft/LoRA repository. The Trainer provides a high-level training loop that handles distributed training, gradient accumulation, mixed-precision training, evaluation strategies, checkpointing, and logging, while LoRA's parameter freezing integrates transparently with the Trainer's optimizer construction.

Trainer and LoRA Integration

The key insight behind using HuggingFace Trainer with LoRA is that parameter freezing is invisible to the Trainer. When parameters have requires_grad=False, they are automatically excluded from:

Optimizer state construction -- Only trainable parameters (LoRA matrices + classifier head) receive optimizer states (momentum, variance in Adam). This dramatically reduces GPU memory for the optimizer, since the vast majority of model parameters are frozen.
Gradient computation -- The backward pass does not compute or store gradients for frozen parameters.
Gradient clipping -- Only trainable parameter gradients are clipped.

For a RoBERTa-base model (125M parameters) with LoRA rank 8, only approximately 0.3M LoRA parameters and approximately 0.6M classifier parameters are trainable. The optimizer memory footprint is reduced by roughly 99.3% compared to full fine-tuning.

Distributed Training

The NLU experiments use PyTorch's torch.distributed.launch for multi-GPU data-parallel training:

python -m torch.distributed.launch --nproc_per_node=8 \
    examples/text-classification/run_glue.py \
    --model_name_or_path roberta-base \
    --task_name mnli \
    --do_train --do_eval \
    --apply_lora --lora_r 8 --lora_alpha 16 \
    ...

The Trainer internally uses DistributedDataParallel (DDP) to synchronize gradients across GPUs. Since only LoRA parameters have gradients, the all-reduce communication volume is minimal.

Training Loop

The Trainer's training loop (invoked via trainer.train()) performs the following per step:

Forward pass -- Input tokens pass through the model; LoRA layers compute W*x + (alpha/r) * B @ A * x
Loss computation -- Cross-entropy for classification tasks, MSE for STS-B regression
Backward pass -- Gradients flow only through LoRA matrices and the classifier head
Gradient accumulation -- When gradient_accumulation_steps > 1, gradients are accumulated before the optimizer step
Optimizer step -- AdamW updates only the trainable parameters
Learning rate scheduling -- Warmup followed by linear decay (configurable via warmup_ratio or warmup_steps)

Evaluation Strategy

The Trainer supports multiple evaluation strategies controlled by --evaluation_strategy:

epoch -- Evaluate at the end of each epoch (used by RoBERTa configs)
steps -- Evaluate every N steps (used by DeBERTa V2 configs with --eval_steps 500)
no -- No intermediate evaluation

For MNLI, the evaluation loop handles both matched and mismatched validation sets:

tasks = [data_args.task_name]
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
    tasks.append("mnli-mm")
    eval_datasets.append(datasets["validation_mismatched"])

Checkpointing

The Trainer saves checkpoints according to --save_strategy (epoch or steps). Each checkpoint includes:

Full model state dict -- All parameters including frozen pretrained weights (this is a HuggingFace Trainer default behavior)
Optimizer state -- Only for trainable parameters
Scheduler state -- Learning rate schedule progress
Trainer state -- Global step, epoch, best metric

Because the Trainer saves the full state dict (not just LoRA weights), a post-hoc extraction step is needed to isolate LoRA-only weights for deployment.

Typical Configurations

RoBERTa-base

Parameter	Value
Model	`roberta-base`
LoRA rank (r)	8
LoRA alpha	16
Learning rate	5e-4
Epochs	30
Max sequence length	512
Per-device batch size	16
Warmup ratio	0.06
Weight decay	0.1
Evaluation strategy	epoch
GPUs	8

DeBERTa V2 XXL

Parameter	Value
Model	`microsoft/deberta-v2-xxlarge`
LoRA rank (r)	16
LoRA alpha	32
Learning rate	1e-4
Epochs	5
Max sequence length	256
Per-device batch size	8
Warmup steps	1000
Weight decay	0
Evaluation strategy	steps (every 500)
FP16	Yes
Classifier dropout	0.15
Deterministic algorithms	Yes
GPUs	8

Seed and Reproducibility

The training scripts set multiple reproducibility controls:

--seed 0 -- Sets Python, NumPy, and PyTorch random seeds via set_seed()
CUBLAS_WORKSPACE_CONFIG=":16:8" -- Ensures deterministic cuBLAS operations
PYTHONHASHSEED=0 -- Ensures deterministic Python hash ordering
--use_deterministic_algorithms -- Enables PyTorch deterministic mode (DeBERTa V2 configs)

Metadata

Field	Value
Source	Repo (microsoft/LoRA)
Domains	Training, NLU, LoRA
Related	Implementation:Microsoft_LoRA_Run_GLUE_Training

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment