Environment:Huggingface Optimum Tensor Parallelization Environment
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, GPU_Acceleration |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Multi-GPU environment with `torch.distributed` process groups, `torch.compile`, Transformers >= 4.20.0, `safetensors`, and `huggingface_hub` for automatic tensor parallelism via PyTorch FX.
Description
This environment provides the infrastructure for automatic tensor parallelism using PyTorch FX graph transformations. It requires multiple GPUs with NCCL-based `torch.distributed` communication, PyTorch's `torch.compile` with fullgraph mode, and internal `torch.fx` APIs for graph manipulation. The parallelization system initializes models on a `meta` device for memory efficiency, then shards weights across devices. Weights are loaded from `safetensors` files via `huggingface_hub`.
Usage
Use this environment when running the Automatic Tensor Parallelization workflow. This requires a multi-GPU setup with an initialized `torch.distributed` process group. The parallelization system uses `torch.compile` to capture the computation graph and then applies parallel transformation passes.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | NCCL requires Linux for multi-GPU communication |
| Hardware | Multiple NVIDIA GPUs | World size > 1 required for parallelism |
| Interconnect | NVLink or PCIe | NCCL for inter-GPU communication |
| Disk | Sufficient for model weights | Safetensors files loaded from disk |
Dependencies
Required Packages
- `torch` >= 2.1.0 (torch.compile, torch.fx, torch.distributed)
- `transformers` >= 4.20.0 (FX features availability check)
- `safetensors` (weight loading from safetensors files)
- `huggingface_hub` (model downloading and weight map collection)
Required PyTorch Modules
- `torch.distributed` (ProcessGroup, all_reduce, all_gather, scatter)
- `torch.compile` with `fullgraph=True` mode
- `torch.fx` (GraphModule, Graph, Node)
- `torch._decomp.core_aten_decompositions` (ATen op decomposition)
- `torch._subclasses.functional_tensor` (FunctionalTensor, FunctionalTensorMode)
- `torch.fx.experimental.proxy_tensor` (ProxyTorchDispatchMode, track_tensor_tree)
Credentials
- `HF_TOKEN`: HuggingFace API token for downloading gated models.
- `HF_HUB_OFFLINE`: Set to `1` to skip downloads and use only local files (checked in `optimum/fx/parallelization/utils.py:385`).
Quick Install
# Install required packages
pip install optimum torch>=2.1.0 transformers>=4.20.0 safetensors huggingface_hub
# Launch with torchrun for multi-GPU
torchrun --nproc_per_node=NUM_GPUS your_script.py
Code Evidence
Transformers version check from `optimum/fx/utils.py:21-27`:
_TRANSFORMERS_MIN_VERSION = version.parse("4.20.0.dev0")
transformers_version = version.parse(transformers.__version__)
_fx_features_available = (_TRANSFORMERS_MIN_VERSION.major, _TRANSFORMERS_MIN_VERSION.minor) <= (
transformers_version.major,
transformers_version.minor,
)
FX features guard decorator from `optimum/fx/utils.py:34-43`:
def check_if_available(func):
@wraps(func)
def wrapper(*args, **kwargs):
if not are_fx_features_available():
raise ImportError(
f"Found an incompatible version of transformers. Found version "
f"{transformers_version}, but only {_TRANSFORMERS_MIN_VERSION} "
f"and above are supported."
)
return func(*args, **kwargs)
return wrapper
torch.compile with fullgraph mode from `optimum/fx/parallelization/api.py:124`:
model = torch.compile(model, fullgraph=True, backend=backend)
Distributed process group requirement from `optimum/fx/parallelization/core.py:144`:
tp_group: dist.ProcessGroup
Divisibility constraint from `optimum/fx/parallelization/utils.py:40-45`:
def ensure_divisibility(numerator: int, denominator: int) -> None:
if numerator % denominator != 0:
raise RuntimeError(
f"{numerator} is not divisible by {denominator}, check if the parallel "
"dimension of weight parameters is divisible by parallelism level"
"(world size of tensor parallel group)"
)
Safetensors weight loading from `optimum/fx/parallelization/passes.py:486-492`:
from safetensors import safe_open
with safe_open(ctx.weight_map[target.source], framework="pt", device="cpu") as fp:
tensor_slice = fp.get_slice(target.source)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Found an incompatible version of transformers` | Transformers < 4.20.0 | `pip install -U transformers>=4.20.0` |
| `{N} is not divisible by {M}` | Model parallel dimension not divisible by world size | Use a world size that divides the model hidden dimensions evenly |
| `illegal path for recompilation` | Pass pipeline encountered unexpected recompilation | Ensure model is in eval mode and inputs are consistent |
| NCCL errors | torch.distributed not initialized | Initialize with `torchrun` or `torch.distributed.init_process_group()` |
Compatibility Notes
- Training mode: Not supported yet. Models are forced to `.eval()` mode during parallelization (`api.py:116` has TODO to support training-time trace).
- Meta device: Models are initialized on `torch.device("meta")` for memory-efficient loading, then moved to actual devices.
- Recompilation: The system tracks compilation times and caches parallel layers to handle torch.compile recompilation correctly.
- Sequence parallelism: Optional Megatron-style sequence parallelism available via `enable_sequence_parallel` config flag (default: False).
- Weight formats: Only `safetensors` format is supported for weight loading in the parallelization pipeline.
Related Pages
- Implementation:Huggingface_Optimum_Download_Model_From_HF
- Implementation:Huggingface_Optimum_MetaAwareMethodsPatcher
- Implementation:Huggingface_Optimum_Initialize_Parameter_Meta
- Implementation:Huggingface_Optimum_ParallelAxisSolverPass_Run
- Implementation:Huggingface_Optimum_ParallelLayerAnnotatePass_Run
- Implementation:Huggingface_Optimum_ParallelLayerReplacePass_Run
- Implementation:Huggingface_Optimum_InitializeOrLoadWeightsPass_Run