Environment:Huggingface Optimum Tensor Parallelization Environment

Knowledge Sources	Huggingface Optimum Parallelization API
Domains	Distributed_Training, GPU_Acceleration
Last Updated	2026-02-15 00:00 GMT

Overview

Multi-GPU environment with `torch.distributed` process groups, `torch.compile`, Transformers >= 4.20.0, `safetensors`, and `huggingface_hub` for automatic tensor parallelism via PyTorch FX.

Description

This environment provides the infrastructure for automatic tensor parallelism using PyTorch FX graph transformations. It requires multiple GPUs with NCCL-based `torch.distributed` communication, PyTorch's `torch.compile` with fullgraph mode, and internal `torch.fx` APIs for graph manipulation. The parallelization system initializes models on a `meta` device for memory efficiency, then shards weights across devices. Weights are loaded from `safetensors` files via `huggingface_hub`.

Usage

Use this environment when running the Automatic Tensor Parallelization workflow. This requires a multi-GPU setup with an initialized `torch.distributed` process group. The parallelization system uses `torch.compile` to capture the computation graph and then applies parallel transformation passes.

System Requirements

Category	Requirement	Notes
OS	Linux	NCCL requires Linux for multi-GPU communication
Hardware	Multiple NVIDIA GPUs	World size > 1 required for parallelism
Interconnect	NVLink or PCIe	NCCL for inter-GPU communication
Disk	Sufficient for model weights	Safetensors files loaded from disk

Dependencies

Required Packages

`torch` >= 2.1.0 (torch.compile, torch.fx, torch.distributed)
`transformers` >= 4.20.0 (FX features availability check)
`safetensors` (weight loading from safetensors files)
`huggingface_hub` (model downloading and weight map collection)

Required PyTorch Modules

`torch.distributed` (ProcessGroup, all_reduce, all_gather, scatter)
`torch.compile` with `fullgraph=True` mode
`torch.fx` (GraphModule, Graph, Node)
`torch._decomp.core_aten_decompositions` (ATen op decomposition)
`torch._subclasses.functional_tensor` (FunctionalTensor, FunctionalTensorMode)
`torch.fx.experimental.proxy_tensor` (ProxyTorchDispatchMode, track_tensor_tree)

Credentials

`HF_TOKEN`: HuggingFace API token for downloading gated models.
`HF_HUB_OFFLINE`: Set to `1` to skip downloads and use only local files (checked in `optimum/fx/parallelization/utils.py:385`).

Quick Install

# Install required packages
pip install optimum torch>=2.1.0 transformers>=4.20.0 safetensors huggingface_hub

# Launch with torchrun for multi-GPU
torchrun --nproc_per_node=NUM_GPUS your_script.py

Code Evidence

Transformers version check from `optimum/fx/utils.py:21-27`:

_TRANSFORMERS_MIN_VERSION = version.parse("4.20.0.dev0")

transformers_version = version.parse(transformers.__version__)
_fx_features_available = (_TRANSFORMERS_MIN_VERSION.major, _TRANSFORMERS_MIN_VERSION.minor) <= (
    transformers_version.major,
    transformers_version.minor,
)

FX features guard decorator from `optimum/fx/utils.py:34-43`:

def check_if_available(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        if not are_fx_features_available():
            raise ImportError(
                f"Found an incompatible version of transformers. Found version "
                f"{transformers_version}, but only {_TRANSFORMERS_MIN_VERSION} "
                f"and above are supported."
            )
        return func(*args, **kwargs)
    return wrapper

torch.compile with fullgraph mode from `optimum/fx/parallelization/api.py:124`:

model = torch.compile(model, fullgraph=True, backend=backend)

Distributed process group requirement from `optimum/fx/parallelization/core.py:144`:

tp_group: dist.ProcessGroup

Divisibility constraint from `optimum/fx/parallelization/utils.py:40-45`:

def ensure_divisibility(numerator: int, denominator: int) -> None:
    if numerator % denominator != 0:
        raise RuntimeError(
            f"{numerator} is not divisible by {denominator}, check if the parallel "
            "dimension of weight parameters is divisible by parallelism level"
            "(world size of tensor parallel group)"
        )

Safetensors weight loading from `optimum/fx/parallelization/passes.py:486-492`:

from safetensors import safe_open
with safe_open(ctx.weight_map[target.source], framework="pt", device="cpu") as fp:
    tensor_slice = fp.get_slice(target.source)

Common Errors

Error Message	Cause	Solution
`Found an incompatible version of transformers`	Transformers < 4.20.0	`pip install -U transformers>=4.20.0`
`{N} is not divisible by {M}`	Model parallel dimension not divisible by world size	Use a world size that divides the model hidden dimensions evenly
`illegal path for recompilation`	Pass pipeline encountered unexpected recompilation	Ensure model is in eval mode and inputs are consistent
NCCL errors	torch.distributed not initialized	Initialize with `torchrun` or `torch.distributed.init_process_group()`

Compatibility Notes

Training mode: Not supported yet. Models are forced to `.eval()` mode during parallelization (`api.py:116` has TODO to support training-time trace).
Meta device: Models are initialized on `torch.device("meta")` for memory-efficient loading, then moved to actual devices.
Recompilation: The system tracks compilation times and caches parallel layers to handle torch.compile recompilation correctly.
Sequence parallelism: Optional Megatron-style sequence parallelism available via `enable_sequence_parallel` config flag (default: False).
Weight formats: Only `safetensors` format is supported for weight loading in the parallelization pipeline.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment