Principle:Huggingface Transformers Tensor Parallel Model Loading
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Training, Model_Loading |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Tensor parallel model loading distributes a pretrained model's weight tensors across multiple devices at load time, so each device holds only its shard of each parallelized layer.
Description
Tensor Parallelism (TP) partitions the weight matrices of a neural network across multiple GPUs so that matrix multiplications are performed in parallel, with each GPU computing a portion of the result. When loading a pretrained model for tensor-parallel training, the weights must be sharded at load time rather than loaded in full and then redistributed, because large models may not fit into a single GPU's memory.
Tensor-parallel model loading in Hugging Face Transformers integrates directly into the from_pretrained API. When a device_mesh and tp_plan="auto" are provided, the loading pipeline:
- Detects the TP sub-mesh from the provided device mesh (extracting the
"tp"dimension if multi-dimensional). - Initializes the distributed backend if not already initialized.
- Reads each model's predefined
_tp_planwhich specifies how each layer should be parallelized (e.g.,"colwise","rowwise","packed_colwise"). - During weight loading, each process loads only its shard of parallelized parameters using the appropriate sharding function (
ColwiseParallel,RowwiseParallel, etc.). - Registers forward hooks on parallelized modules for the communication operations needed during the forward pass (all-reduce, all-gather, split, reduce-scatter).
This approach avoids the memory overhead of loading the full model on every device and eliminates the need for a separate distribution step after loading.
Usage
Use tensor-parallel model loading when:
- The model is too large to fit on a single GPU and must be split across multiple GPUs.
- You want to leverage Megatron-style tensor parallelism for faster training throughput.
- You are combining TP with other parallelism strategies (DP, CP) in a 3D parallel configuration.
- The model architecture supports a TP plan (most decoder-only LLMs in Transformers have a predefined
_tp_plan).
Theoretical Basis
Tensor parallelism was formalized in the Megatron-LM paper (Shoeybi et al., 2019), which demonstrated how to partition transformer layers across GPUs:
- Column-wise parallelism (ColwiseParallel): The weight matrix is split along the output dimension. Each GPU computes a portion of the output features. The input is replicated (identity forward, all-reduce backward). The output is sharded across devices.
- Row-wise parallelism (RowwiseParallel): The weight matrix is split along the input dimension. Each GPU receives a portion of the input features. The partial outputs are all-reduced to produce the final result.
For a two-layer MLP with column-wise first layer and row-wise second layer, only one all-reduce is needed per forward pass (at the output of the second layer), and one all-reduce per backward pass (at the input of the first layer). This minimizes communication overhead.
The tp_plan in Hugging Face Transformers encodes this Megatron-style partitioning. It maps module names to sharding styles using wildcards (e.g., "model.layers.*.self_attn.q_proj": "colwise"), allowing the same plan to apply across all transformer layers.
Special cases handled:
- Packed weights (e.g.,
gate_up_proj): When multiple projections are fused into a single tensor, interleaved sharding ensures each GPU gets the correct portion of each projection. - Embedding parallelism: Vocabulary embeddings can be sharded along the vocabulary dimension, with masked lookups to handle out-of-range tokens.