Principle:Huggingface Transformers Tensor Parallel Model Loading

Knowledge Sources	Megatron-LM Tensor Parallelism PyTorch Distributed Transformers Docs
Domains	Distributed_Computing, Training, Model_Loading
Last Updated	2026-02-13 00:00 GMT

Overview

Tensor parallel model loading distributes a pretrained model's weight tensors across multiple devices at load time, so each device holds only its shard of each parallelized layer.

Description

Tensor Parallelism (TP) partitions the weight matrices of a neural network across multiple GPUs so that matrix multiplications are performed in parallel, with each GPU computing a portion of the result. When loading a pretrained model for tensor-parallel training, the weights must be sharded at load time rather than loaded in full and then redistributed, because large models may not fit into a single GPU's memory.

Tensor-parallel model loading in Hugging Face Transformers integrates directly into the from_pretrained API. When a device_mesh and tp_plan="auto" are provided, the loading pipeline:

Detects the TP sub-mesh from the provided device mesh (extracting the "tp" dimension if multi-dimensional).
Initializes the distributed backend if not already initialized.
Reads each model's predefined _tp_plan which specifies how each layer should be parallelized (e.g., "colwise", "rowwise", "packed_colwise").
During weight loading, each process loads only its shard of parallelized parameters using the appropriate sharding function (ColwiseParallel, RowwiseParallel, etc.).
Registers forward hooks on parallelized modules for the communication operations needed during the forward pass (all-reduce, all-gather, split, reduce-scatter).

This approach avoids the memory overhead of loading the full model on every device and eliminates the need for a separate distribution step after loading.

Usage

Use tensor-parallel model loading when:

The model is too large to fit on a single GPU and must be split across multiple GPUs.
You want to leverage Megatron-style tensor parallelism for faster training throughput.
You are combining TP with other parallelism strategies (DP, CP) in a 3D parallel configuration.
The model architecture supports a TP plan (most decoder-only LLMs in Transformers have a predefined _tp_plan).

Theoretical Basis

Tensor parallelism was formalized in the Megatron-LM paper (Shoeybi et al., 2019), which demonstrated how to partition transformer layers across GPUs:

Column-wise parallelism (ColwiseParallel): The weight matrix is split along the output dimension. Each GPU computes a portion of the output features. The input is replicated (identity forward, all-reduce backward). The output is sharded across devices.
Row-wise parallelism (RowwiseParallel): The weight matrix is split along the input dimension. Each GPU receives a portion of the input features. The partial outputs are all-reduced to produce the final result.

For a two-layer MLP with column-wise first layer and row-wise second layer, only one all-reduce is needed per forward pass (at the output of the second layer), and one all-reduce per backward pass (at the input of the first layer). This minimizes communication overhead.

The tp_plan in Hugging Face Transformers encodes this Megatron-style partitioning. It maps module names to sharding styles using wildcards (e.g., "model.layers.*.self_attn.q_proj": "colwise"), allowing the same plan to apply across all transformer layers.

Special cases handled:

Packed weights (e.g., gate_up_proj): When multiple projections are fused into a single tensor, interleaved sharding ensures each GPU gets the correct portion of each projection.
Embedding parallelism: Vocabulary embeddings can be sharded along the vocabulary dimension, with masked lookups to handle out-of-range tokens.

Related Pages

Implemented By

Implementation:Huggingface_Transformers_AutoModelForCausalLM_From_Pretrained_For_TP

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment