Environment:Microsoft LoRA PyTorch CUDA Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-10 05:30 GMT |
Overview
PyTorch GPU environment with CUDA and NCCL backend required for training LoRA-adapted models across all workflows (NLG and NLU).
Description
This environment provides the core GPU-accelerated context for all LoRA training and inference. The loralib package itself only depends on PyTorch (torch) and requires Python >= 3.6. Training scripts use torch.distributed with the NCCL backend for multi-GPU support. The NLG scripts support four distributed platforms: local (torch.distributed.launch), k8s (Kubernetes with OpenMPI), philly (Microsoft internal), and azure (Horovod). All platforms require NVIDIA CUDA GPUs.
Usage
Use this environment for any workflow in the Microsoft LoRA repository. It is the mandatory prerequisite for the LoRA Integration, GPT-2 NLG Finetuning, and NLU GLUE Finetuning workflows. The model is explicitly moved to CUDA via `lm_net = lm_net.cuda()` in the NLG training script, and distributed training requires NCCL.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | NCCL backend requires Linux; Windows/macOS not supported for distributed training |
| Hardware | NVIDIA GPU with CUDA support | Multi-GPU recommended for distributed training |
| Python | >= 3.6 | Specified in `setup.py:L21` |
Dependencies
System Packages
- `cuda-toolkit` (CUDA 10.1 for NLG, CUDA 11.1 for NLU)
- `nccl` (required for `torch.distributed` backend)
Python Packages
- `loralib` == 0.1.2 (or `pip install loralib`)
- `torch` >= 1.7.1 (NLG uses 1.7.1+cu101, NLU uses 1.9.0+cu111)
- `numpy`
NLG-Specific Packages (examples/NLG/requirement.txt)
- `torch` == 1.7.1+cu101
- `transformers` == 3.3.1
- `spacy`
- `tqdm`
- `tensorboard`
- `progress`
Credentials
No API keys or credentials required for the core loralib package. The NLG pretrained checkpoint download uses public S3 URLs. The NLU workflow may require access to Hugging Face model hub for downloading RoBERTa/DeBERTa checkpoints.
Quick Install
# Install loralib (core package)
pip install loralib
# For NLG example (GPT-2 fine-tuning)
pip install torch==1.7.1 transformers==3.3.1 spacy tqdm tensorboard progress
# For NLU example (GLUE fine-tuning), use the conda environment:
# conda env create -f examples/NLU/environment.yml
Code Evidence
Python version requirement from `setup.py:21`:
python_requires='>=3.6',
CUDA device usage from `examples/NLG/src/gpt2_ft.py:326`:
lm_net = lm_net.cuda()
NCCL distributed backend from `examples/NLG/src/gpu.py:59`:
dist.init_process_group(backend='nccl')
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device('cuda', local_rank)
Kubernetes platform environment variables from `examples/NLG/src/gpu.py:97-101`:
master_uri = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
Optional fp16 via NVIDIA Apex from `examples/NLG/src/gpt2_ft.py:266-270`:
if args.fp16:
try:
from apex import amp
except Exception as e:
warnings.warn('Could not import amp, apex may not be installed')
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: NCCL error` | NCCL not installed or GPU not available | Install NCCL and ensure NVIDIA drivers are loaded |
| `Could not import amp, apex may not be installed` | NVIDIA Apex not installed for fp16 training | `pip install apex` or omit `--fp16` flag |
| `CUDA out of memory` | Insufficient VRAM for model size | Reduce `--train_batch_size` or use smaller model card (gpt2.sm) |
Compatibility Notes
- Distributed Platforms: Four platforms supported: `local` (torch.distributed.launch), `k8s` (Kubernetes/OpenMPI), `philly` (Microsoft internal), `azure` (Horovod). Set via `--platform` flag.
- Azure/Horovod: Requires `horovod` package with torch support. Uses `hvd.DistributedOptimizer` instead of DDP.
- FP16: Requires NVIDIA Apex (`from apex import amp`). Used with `opt_level="O1"`.
- CUBLAS Reproducibility: NLU scripts set `CUBLAS_WORKSPACE_CONFIG=":16:8"` for deterministic results.