Principle:Microsoft LoRA Distributed LoRA Training
Overview
Distributed LoRA Training is the principle of fine-tuning a LoRA-augmented GPT-2 model across multiple GPUs using PyTorch's distributed data parallel (DDP) framework. This approach combines data parallelism, gradient accumulation, learning rate scheduling, optional mixed precision, and LoRA-only checkpoint saving to enable efficient and reproducible training of low-rank adapted language models.
Description
Data Parallelism
The training script uses torch.distributed.launch to spawn one process per GPU. Each process loads the full model and receives a disjoint partition of the training data via torch.utils.data.distributed.DistributedSampler. Gradients are synchronized across processes through PyTorch's DDP wrapper, ensuring that all replicas maintain identical parameters after each optimizer step.
Gradient Accumulation
When the effective batch size exceeds GPU memory capacity, the --grad_acc parameter specifies how many forward-backward passes to accumulate before performing a single optimizer step. The loss is divided by the accumulation factor (_lm_loss / args.grad_acc) to maintain correct gradient scaling. The optimizer step and gradient zeroing only occur when train_step % args.grad_acc == 0.
Learning Rate Scheduling
The training pipeline supports four learning rate scheduling strategies:
- linear -- Linear warmup followed by linear decay to zero. This is the default scheduler.
- cosine -- Cosine annealing with warmup, decaying from
max_lrtomin_lr=0following a cosine curve. - cycle -- Cyclic scheduling with user-defined interval steps and learning rates, enabling multi-phase training with different rates.
- constant -- Constant learning rate after an initial linear warmup period.
All schedulers support a warmup_step parameter that controls the number of steps during which the learning rate increases linearly from zero.
Label Smoothing
Label smoothing regularization replaces the hard one-hot target distribution with a mixture: (1 - epsilon) * one_hot + epsilon * uniform. This is controlled by the --label_smooth parameter (default 0.0). When enabled (>0.0001), the loss is computed as:
loss = (1 - label_smooth) * nll_loss + label_smooth * smooth_loss
where smooth_loss is the negative mean of log probabilities across the vocabulary.
FP16 Mixed Precision
When --fp16 is enabled, the training script uses NVIDIA Apex's amp (Automatic Mixed Precision) with optimization level O1 to train in half-precision where safe. This reduces memory consumption and accelerates training on GPUs with Tensor Cores. Gradient clipping under FP16 uses amp.master_params(optimizer) to clip in FP32 master weights.
LoRA-Only Checkpoint Saving
At regular intervals (controlled by --save_interval), the training script saves only the LoRA parameters using:
torch.save({'model_state_dict': lora.lora_state_dict(model)}, model_path)
The lora.lora_state_dict() function filters the model's state dict to include only LoRA-specific parameters (the low-rank A and B matrices), resulting in checkpoint files that are orders of magnitude smaller than full model checkpoints. The final epoch checkpoint saves the complete model state dict.
Optimizer
The training uses AdamW with decoupled weight decay. Key optimizer parameters include:
- lr (default: 0.00001) -- Learning rate.
- weight_decay (default: 0.01) -- Weight decay coefficient.
- adam_beta1 (default: 0.9) -- First moment exponential decay rate.
- adam_beta2 (default: 0.98) -- Second moment exponential decay rate.
- adam_epsilon (default: 1e-6) -- Numerical stability epsilon.
When lora_dim > 0, lora.mark_only_lora_as_trainable(model) is called before optimizer creation, ensuring that only the LoRA parameters receive gradients while all pretrained weights remain frozen.
Theoretical Basis
Distributed data parallelism is mathematically equivalent to training with a larger batch size: each GPU processes batch_size samples, and with N GPUs the effective batch size is N * batch_size * grad_acc. The allreduce operation averages gradients across replicas, producing the same gradient as a single-GPU run with the full batch. LoRA's parameter efficiency makes distributed training particularly effective: with only ~0.35M trainable parameters for GPT-2 Medium rank 4, the gradient synchronization overhead is minimal.
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Domains | Training, NLG |
| Type | External Tool Doc |
| Last Updated | 2026-02-10 |