Principle:Microsoft LoRA HF Trainer LoRA Training
Overview
HF Trainer LoRA Training describes how the HuggingFace Trainer class orchestrates LoRA fine-tuning on GLUE benchmark tasks within the microsoft/LoRA repository. The Trainer provides a high-level training loop that handles distributed training, gradient accumulation, mixed-precision training, evaluation strategies, checkpointing, and logging, while LoRA's parameter freezing integrates transparently with the Trainer's optimizer construction.
Trainer and LoRA Integration
The key insight behind using HuggingFace Trainer with LoRA is that parameter freezing is invisible to the Trainer. When parameters have requires_grad=False, they are automatically excluded from:
- Optimizer state construction -- Only trainable parameters (LoRA matrices + classifier head) receive optimizer states (momentum, variance in Adam). This dramatically reduces GPU memory for the optimizer, since the vast majority of model parameters are frozen.
- Gradient computation -- The backward pass does not compute or store gradients for frozen parameters.
- Gradient clipping -- Only trainable parameter gradients are clipped.
For a RoBERTa-base model (125M parameters) with LoRA rank 8, only approximately 0.3M LoRA parameters and approximately 0.6M classifier parameters are trainable. The optimizer memory footprint is reduced by roughly 99.3% compared to full fine-tuning.
Distributed Training
The NLU experiments use PyTorch's torch.distributed.launch for multi-GPU data-parallel training:
python -m torch.distributed.launch --nproc_per_node=8 \
examples/text-classification/run_glue.py \
--model_name_or_path roberta-base \
--task_name mnli \
--do_train --do_eval \
--apply_lora --lora_r 8 --lora_alpha 16 \
...
The Trainer internally uses DistributedDataParallel (DDP) to synchronize gradients across GPUs. Since only LoRA parameters have gradients, the all-reduce communication volume is minimal.
Training Loop
The Trainer's training loop (invoked via trainer.train()) performs the following per step:
- Forward pass -- Input tokens pass through the model; LoRA layers compute
W*x + (alpha/r) * B @ A * x - Loss computation -- Cross-entropy for classification tasks, MSE for STS-B regression
- Backward pass -- Gradients flow only through LoRA matrices and the classifier head
- Gradient accumulation -- When
gradient_accumulation_steps > 1, gradients are accumulated before the optimizer step - Optimizer step -- AdamW updates only the trainable parameters
- Learning rate scheduling -- Warmup followed by linear decay (configurable via
warmup_ratioorwarmup_steps)
Evaluation Strategy
The Trainer supports multiple evaluation strategies controlled by --evaluation_strategy:
epoch-- Evaluate at the end of each epoch (used by RoBERTa configs)steps-- Evaluate every N steps (used by DeBERTa V2 configs with--eval_steps 500)no-- No intermediate evaluation
For MNLI, the evaluation loop handles both matched and mismatched validation sets:
tasks = [data_args.task_name]
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
tasks.append("mnli-mm")
eval_datasets.append(datasets["validation_mismatched"])
Checkpointing
The Trainer saves checkpoints according to --save_strategy (epoch or steps). Each checkpoint includes:
- Full model state dict -- All parameters including frozen pretrained weights (this is a HuggingFace Trainer default behavior)
- Optimizer state -- Only for trainable parameters
- Scheduler state -- Learning rate schedule progress
- Trainer state -- Global step, epoch, best metric
Because the Trainer saves the full state dict (not just LoRA weights), a post-hoc extraction step is needed to isolate LoRA-only weights for deployment.
Typical Configurations
RoBERTa-base
| Parameter | Value |
|---|---|
| Model | roberta-base
|
| LoRA rank (r) | 8 |
| LoRA alpha | 16 |
| Learning rate | 5e-4 |
| Epochs | 30 |
| Max sequence length | 512 |
| Per-device batch size | 16 |
| Warmup ratio | 0.06 |
| Weight decay | 0.1 |
| Evaluation strategy | epoch |
| GPUs | 8 |
DeBERTa V2 XXL
| Parameter | Value |
|---|---|
| Model | microsoft/deberta-v2-xxlarge
|
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| Learning rate | 1e-4 |
| Epochs | 5 |
| Max sequence length | 256 |
| Per-device batch size | 8 |
| Warmup steps | 1000 |
| Weight decay | 0 |
| Evaluation strategy | steps (every 500) |
| FP16 | Yes |
| Classifier dropout | 0.15 |
| Deterministic algorithms | Yes |
| GPUs | 8 |
Seed and Reproducibility
The training scripts set multiple reproducibility controls:
--seed 0-- Sets Python, NumPy, and PyTorch random seeds viaset_seed()CUBLAS_WORKSPACE_CONFIG=":16:8"-- Ensures deterministic cuBLAS operationsPYTHONHASHSEED=0-- Ensures deterministic Python hash ordering--use_deterministic_algorithms-- Enables PyTorch deterministic mode (DeBERTa V2 configs)
Metadata
| Field | Value |
|---|---|
| Source | Repo (microsoft/LoRA) |
| Domains | Training, NLU, LoRA |
| Related | Implementation:Microsoft_LoRA_Run_GLUE_Training |