Environment:Microsoft LoRA PyTorch CUDA Environment

Knowledge Sources	Microsoft LoRA PyTorch
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-10 05:30 GMT

Overview

PyTorch GPU environment with CUDA and NCCL backend required for training LoRA-adapted models across all workflows (NLG and NLU).

Description

This environment provides the core GPU-accelerated context for all LoRA training and inference. The loralib package itself only depends on PyTorch (torch) and requires Python >= 3.6. Training scripts use torch.distributed with the NCCL backend for multi-GPU support. The NLG scripts support four distributed platforms: local (torch.distributed.launch), k8s (Kubernetes with OpenMPI), philly (Microsoft internal), and azure (Horovod). All platforms require NVIDIA CUDA GPUs.

Usage

Use this environment for any workflow in the Microsoft LoRA repository. It is the mandatory prerequisite for the LoRA Integration, GPT-2 NLG Finetuning, and NLU GLUE Finetuning workflows. The model is explicitly moved to CUDA via `lm_net = lm_net.cuda()` in the NLG training script, and distributed training requires NCCL.

System Requirements

Category	Requirement	Notes
OS	Linux	NCCL backend requires Linux; Windows/macOS not supported for distributed training
Hardware	NVIDIA GPU with CUDA support	Multi-GPU recommended for distributed training
Python	>= 3.6	Specified in `setup.py:L21`

Dependencies

System Packages

`cuda-toolkit` (CUDA 10.1 for NLG, CUDA 11.1 for NLU)
`nccl` (required for `torch.distributed` backend)

Python Packages

`loralib` == 0.1.2 (or `pip install loralib`)
`torch` >= 1.7.1 (NLG uses 1.7.1+cu101, NLU uses 1.9.0+cu111)
`numpy`

NLG-Specific Packages (examples/NLG/requirement.txt)

`torch` == 1.7.1+cu101
`transformers` == 3.3.1
`spacy`
`tqdm`
`tensorboard`
`progress`

Credentials

No API keys or credentials required for the core loralib package. The NLG pretrained checkpoint download uses public S3 URLs. The NLU workflow may require access to Hugging Face model hub for downloading RoBERTa/DeBERTa checkpoints.

Quick Install

# Install loralib (core package)
pip install loralib

# For NLG example (GPT-2 fine-tuning)
pip install torch==1.7.1 transformers==3.3.1 spacy tqdm tensorboard progress

# For NLU example (GLUE fine-tuning), use the conda environment:
# conda env create -f examples/NLU/environment.yml

Code Evidence

Python version requirement from `setup.py:21`:

python_requires='>=3.6',

CUDA device usage from `examples/NLG/src/gpt2_ft.py:326`:

lm_net = lm_net.cuda()

NCCL distributed backend from `examples/NLG/src/gpu.py:59`:

dist.init_process_group(backend='nccl')
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device('cuda', local_rank)

Kubernetes platform environment variables from `examples/NLG/src/gpu.py:97-101`:

master_uri = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])

Optional fp16 via NVIDIA Apex from `examples/NLG/src/gpt2_ft.py:266-270`:

if args.fp16:
    try:
        from apex import amp
    except Exception as e:
        warnings.warn('Could not import amp, apex may not be installed')

Common Errors

Error Message	Cause	Solution
`RuntimeError: NCCL error`	NCCL not installed or GPU not available	Install NCCL and ensure NVIDIA drivers are loaded
`Could not import amp, apex may not be installed`	NVIDIA Apex not installed for fp16 training	`pip install apex` or omit `--fp16` flag
`CUDA out of memory`	Insufficient VRAM for model size	Reduce `--train_batch_size` or use smaller model card (gpt2.sm)

Compatibility Notes

Distributed Platforms: Four platforms supported: `local` (torch.distributed.launch), `k8s` (Kubernetes/OpenMPI), `philly` (Microsoft internal), `azure` (Horovod). Set via `--platform` flag.
Azure/Horovod: Requires `horovod` package with torch support. Uses `hvd.DistributedOptimizer` instead of DDP.
FP16: Requires NVIDIA Apex (`from apex import amp`). Used with `opt_level="O1"`.
CUBLAS Reproducibility: NLU scripts set `CUBLAS_WORKSPACE_CONFIG=":16:8"` for deterministic results.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment