Principle:FMInference FlexLLMGen Distributed Pipeline Initialization

Knowledge Sources	FMInference_FlexLLMGen
Domains	Distributed Computing, Pipeline Parallelism
Last Updated	2026-02-09 12:00 GMT

Overview

Establishing pairwise communication groups in a ring topology enables efficient point-to-point tensor transfers between adjacent pipeline-parallel stages.

Description

When distributing a large transformer model across multiple devices using pipeline parallelism, each device holds a contiguous subset of the model's layers. During inference, hidden states must flow sequentially from one stage to the next. Rather than relying on collective operations (such as all-reduce or broadcast), pipeline parallelism requires only point-to-point communication between adjacent stages.

The initialization principle has two key phases:

Phase 1 -- Backend Selection: The distributed communication backend must match the physical interconnect. GPU-resident tensors benefit from NCCL, which exploits NVLink or PCIe for high-bandwidth transfers. CPU-resident tensors (used when offloading to host memory) require Gloo, which communicates over TCP/shared memory.

Phase 2 -- Pairwise Group Construction: For N pipeline stages numbered 0 through N-1, we create N process groups, each containing exactly two ranks: (i, (i+1) mod N). This forms a logical ring. Each rank stores two references: its predecessor group (the group where it appears as the successor) and its successor group (the group where it appears as the predecessor). These groups are used for all subsequent send and recv operations during generation.

Usage

Apply this principle whenever partitioning a sequential model across multiple devices for pipeline-parallel inference. It is essential when the number of pipeline stages exceeds one and hidden-state tensors must be transferred between stages at every generation step.

Theoretical Basis

Pipeline Parallelism Communication Model

In pipeline parallelism, a model with L layers is divided into N stages. Stage s holds layers [l_s, l_{s+1}). The inference computation for a single token proceeds as:

For each generation step i:
    For each stage s = 0, 1, ..., N-1:
        h = stage_s.forward(h)
        if s < N-1:
            send(h) to stage s+1
            recv(h) at stage s+1

Ring Topology for Process Groups

The ring structure (s, (s+1) mod N) ensures:

Each send/recv operation involves exactly two processes, minimizing synchronization overhead.
The modular arithmetic handles the wrap-around case (stage N-1 to stage 0) uniformly, which is useful when batches cycle through multiple inner iterations.
No global barriers are needed during the generation loop; only pairwise synchronization occurs.

Backend Selection Criteria

Criterion	NCCL (GPU)	Gloo (CPU)
Tensor location	GPU memory	Host (CPU) memory
Interconnect	NVLink, PCIe, InfiniBand	TCP sockets, shared memory
Bandwidth	High (up to 600 GB/s NVLink)	Moderate (network-bound)
Use case	Default for GPU-to-GPU	Offloading scenarios

Output Suppression

In multi-process environments, every rank executing print statements produces interleaved, hard-to-read logs. The principle of rank-gated output replaces the built-in print so that only rank 0 (or explicitly forced output) is displayed, maintaining clean, readable logs while still allowing per-rank debugging when needed.

Related Pages

Implementation:FMInference_FlexLLMGen_Initialize_Distributed

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment