Principle:FMInference FlexLLMGen Distributed Pipeline Initialization
| Knowledge Sources | |
|---|---|
| Domains | Distributed Computing, Pipeline Parallelism |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Establishing pairwise communication groups in a ring topology enables efficient point-to-point tensor transfers between adjacent pipeline-parallel stages.
Description
When distributing a large transformer model across multiple devices using pipeline parallelism, each device holds a contiguous subset of the model's layers. During inference, hidden states must flow sequentially from one stage to the next. Rather than relying on collective operations (such as all-reduce or broadcast), pipeline parallelism requires only point-to-point communication between adjacent stages.
The initialization principle has two key phases:
Phase 1 -- Backend Selection: The distributed communication backend must match the physical interconnect. GPU-resident tensors benefit from NCCL, which exploits NVLink or PCIe for high-bandwidth transfers. CPU-resident tensors (used when offloading to host memory) require Gloo, which communicates over TCP/shared memory.
Phase 2 -- Pairwise Group Construction: For N pipeline stages numbered 0 through N-1, we create N process groups, each containing exactly two ranks: (i, (i+1) mod N). This forms a logical ring. Each rank stores two references: its predecessor group (the group where it appears as the successor) and its successor group (the group where it appears as the predecessor). These groups are used for all subsequent send and recv operations during generation.
Usage
Apply this principle whenever partitioning a sequential model across multiple devices for pipeline-parallel inference. It is essential when the number of pipeline stages exceeds one and hidden-state tensors must be transferred between stages at every generation step.
Theoretical Basis
Pipeline Parallelism Communication Model
In pipeline parallelism, a model with L layers is divided into N stages. Stage s holds layers [l_s, l_{s+1}). The inference computation for a single token proceeds as:
For each generation step i:
For each stage s = 0, 1, ..., N-1:
h = stage_s.forward(h)
if s < N-1:
send(h) to stage s+1
recv(h) at stage s+1
Ring Topology for Process Groups
The ring structure (s, (s+1) mod N) ensures:
- Each send/recv operation involves exactly two processes, minimizing synchronization overhead.
- The modular arithmetic handles the wrap-around case (stage N-1 to stage 0) uniformly, which is useful when batches cycle through multiple inner iterations.
- No global barriers are needed during the generation loop; only pairwise synchronization occurs.
Backend Selection Criteria
| Criterion | NCCL (GPU) | Gloo (CPU) |
|---|---|---|
| Tensor location | GPU memory | Host (CPU) memory |
| Interconnect | NVLink, PCIe, InfiniBand | TCP sockets, shared memory |
| Bandwidth | High (up to 600 GB/s NVLink) | Moderate (network-bound) |
| Use case | Default for GPU-to-GPU | Offloading scenarios |
Output Suppression
In multi-process environments, every rank executing print statements produces interleaved, hard-to-read logs. The principle of rank-gated output replaces the built-in print so that only rank 0 (or explicitly forced output) is displayed, maintaining clean, readable logs while still allowing per-rank debugging when needed.