Principle:Huggingface Optimum Parallel Layer Annotation

Overview

Process of classifying model layers into parallel categories (column, row, vocab) based on the solved parallel axis assignments.

Description

After axis solving determines which tensor dimensions are parallel, the annotation pass examines each Linear, Embedding, and CrossEntropy layer to determine its parallelization strategy. The classification rules are:

A Linear layer with parallel axis on the output dimension becomes "column-parallel" -- the weight matrix is split along its columns (output features), with each GPU holding a subset of output neurons.
A Linear layer with parallel axis on the input dimension becomes "row-parallel" -- the weight matrix is split along its rows (input features), with each GPU holding a full set of output neurons but only a subset of input connections.
An Embedding layer with parallel axis on the vocabulary dimension becomes "vocab-parallel" -- the embedding table is split across the vocabulary axis, with each GPU holding embeddings for a subset of token IDs.

The pass also determines whether each layer should gather its output (concatenate partial results from all ranks) or pass it partitioned to the next layer. This decision depends on whether the downstream consumer expects a full tensor or a partitioned one.

Usage

This pass runs automatically after ParallelAxisSolverPass in the parallelization pipeline. It consumes the parallel axis annotations produced by the solver and produces layer-level classification annotations consumed by the replacement pass.

Theoretical Basis

Megatron-LM style tensor parallelism. The annotation strategy follows the patterns established by Megatron-LM:

Transformer Component	Parallelization Pattern	Rationale
Q, K, V projections	Column-parallel (split output)	Each GPU computes a subset of attention heads independently.
Attention output projection	Row-parallel (split input)	Input is already partitioned by heads; output is all-reduced.
MLP first projection	Column-parallel (split output)	Each GPU computes a subset of the hidden dimension.
MLP second projection	Row-parallel (split input)	Input is already partitioned; output is all-reduced.
Token embedding	Vocab-parallel (split vocabulary)	Each GPU handles a subset of the vocabulary.

This pattern minimizes inter-GPU communication to one all-reduce per transformer layer (two total: one after attention, one after MLP). The all-reduce is performed at the row-parallel layer, which sums the partial results from all ranks.

Connections

Implementation:Huggingface_Optimum_ParallelLayerAnnotatePass_Run

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment