Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft LoRA HF Trainer LoRA Training

From Leeroopedia


Overview

HF Trainer LoRA Training describes how the HuggingFace Trainer class orchestrates LoRA fine-tuning on GLUE benchmark tasks within the microsoft/LoRA repository. The Trainer provides a high-level training loop that handles distributed training, gradient accumulation, mixed-precision training, evaluation strategies, checkpointing, and logging, while LoRA's parameter freezing integrates transparently with the Trainer's optimizer construction.

Trainer and LoRA Integration

The key insight behind using HuggingFace Trainer with LoRA is that parameter freezing is invisible to the Trainer. When parameters have requires_grad=False, they are automatically excluded from:

  • Optimizer state construction -- Only trainable parameters (LoRA matrices + classifier head) receive optimizer states (momentum, variance in Adam). This dramatically reduces GPU memory for the optimizer, since the vast majority of model parameters are frozen.
  • Gradient computation -- The backward pass does not compute or store gradients for frozen parameters.
  • Gradient clipping -- Only trainable parameter gradients are clipped.

For a RoBERTa-base model (125M parameters) with LoRA rank 8, only approximately 0.3M LoRA parameters and approximately 0.6M classifier parameters are trainable. The optimizer memory footprint is reduced by roughly 99.3% compared to full fine-tuning.

Distributed Training

The NLU experiments use PyTorch's torch.distributed.launch for multi-GPU data-parallel training:

python -m torch.distributed.launch --nproc_per_node=8 \
    examples/text-classification/run_glue.py \
    --model_name_or_path roberta-base \
    --task_name mnli \
    --do_train --do_eval \
    --apply_lora --lora_r 8 --lora_alpha 16 \
    ...

The Trainer internally uses DistributedDataParallel (DDP) to synchronize gradients across GPUs. Since only LoRA parameters have gradients, the all-reduce communication volume is minimal.

Training Loop

The Trainer's training loop (invoked via trainer.train()) performs the following per step:

  • Forward pass -- Input tokens pass through the model; LoRA layers compute W*x + (alpha/r) * B @ A * x
  • Loss computation -- Cross-entropy for classification tasks, MSE for STS-B regression
  • Backward pass -- Gradients flow only through LoRA matrices and the classifier head
  • Gradient accumulation -- When gradient_accumulation_steps > 1, gradients are accumulated before the optimizer step
  • Optimizer step -- AdamW updates only the trainable parameters
  • Learning rate scheduling -- Warmup followed by linear decay (configurable via warmup_ratio or warmup_steps)

Evaluation Strategy

The Trainer supports multiple evaluation strategies controlled by --evaluation_strategy:

  • epoch -- Evaluate at the end of each epoch (used by RoBERTa configs)
  • steps -- Evaluate every N steps (used by DeBERTa V2 configs with --eval_steps 500)
  • no -- No intermediate evaluation

For MNLI, the evaluation loop handles both matched and mismatched validation sets:

tasks = [data_args.task_name]
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
    tasks.append("mnli-mm")
    eval_datasets.append(datasets["validation_mismatched"])

Checkpointing

The Trainer saves checkpoints according to --save_strategy (epoch or steps). Each checkpoint includes:

  • Full model state dict -- All parameters including frozen pretrained weights (this is a HuggingFace Trainer default behavior)
  • Optimizer state -- Only for trainable parameters
  • Scheduler state -- Learning rate schedule progress
  • Trainer state -- Global step, epoch, best metric

Because the Trainer saves the full state dict (not just LoRA weights), a post-hoc extraction step is needed to isolate LoRA-only weights for deployment.

Typical Configurations

RoBERTa-base

Parameter Value
Model roberta-base
LoRA rank (r) 8
LoRA alpha 16
Learning rate 5e-4
Epochs 30
Max sequence length 512
Per-device batch size 16
Warmup ratio 0.06
Weight decay 0.1
Evaluation strategy epoch
GPUs 8

DeBERTa V2 XXL

Parameter Value
Model microsoft/deberta-v2-xxlarge
LoRA rank (r) 16
LoRA alpha 32
Learning rate 1e-4
Epochs 5
Max sequence length 256
Per-device batch size 8
Warmup steps 1000
Weight decay 0
Evaluation strategy steps (every 500)
FP16 Yes
Classifier dropout 0.15
Deterministic algorithms Yes
GPUs 8

Seed and Reproducibility

The training scripts set multiple reproducibility controls:

  • --seed 0 -- Sets Python, NumPy, and PyTorch random seeds via set_seed()
  • CUBLAS_WORKSPACE_CONFIG=":16:8" -- Ensures deterministic cuBLAS operations
  • PYTHONHASHSEED=0 -- Ensures deterministic Python hash ordering
  • --use_deterministic_algorithms -- Enables PyTorch deterministic mode (DeBERTa V2 configs)

Metadata

Field Value
Source Repo (microsoft/LoRA)
Domains Training, NLU, LoRA
Related Implementation:Microsoft_LoRA_Run_GLUE_Training

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment