Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FMInference FlexLLMGen Distributed Pipeline Parallel Inference

From Leeroopedia


Knowledge Sources
Domains Distributed Computing, Pipeline Parallelism, LLM Inference
Last Updated 2026-02-09 12:00 GMT

Overview

Partitioning a large transformer model across multiple devices as a sequence of pipeline stages, with overlapped computation and communication, enables inference of models that exceed single-device memory.

Description

Pipeline-parallel inference divides a transformer's layers into contiguous groups assigned to separate devices (stages). Each stage processes its layers and forwards the resulting hidden states to the next stage. This approach addresses the fundamental memory bottleneck: a model with hundreds of billions of parameters cannot fit on a single GPU, but can be distributed so that each device holds only a fraction of the weights, caches, and activations.

The key design decisions in this architecture are:

1. Layer Partitioning: Given L transformer layers and N stages, each stage receives approximately L/N layers. The first stage also handles input embedding, and the last stage handles output embedding and token selection. The partition uses a greedy remainder strategy: the first L mod N stages receive one extra layer each.

2. Communication Pattern: Hidden states flow unidirectionally from stage 0 to stage N-1. At each generation step, each stage receives a hidden-state tensor from its predecessor, runs its local layers, and sends the result to its successor. This requires only pairwise send/recv, avoiding costly collective operations.

3. Deadlock Prevention: With synchronous communication, a naive approach where every rank sends then receives creates a circular dependency. The solution is asymmetric ordering: one designated rank (rank 0) receives first then sends, while all other ranks send first then receive. This breaks the cycle.

4. Overlapped Scheduling: To hide I/O latency (weight loading from CPU/disk, cache movement), computation and communication can be overlapped using CUDA streams. Three strategies exist:

  • No overlap: Sequential execution; simplest but slowest.
  • Single-batch overlap: Pipelining weight loads, cache loads, compute, and cache stores across consecutive layers.
  • Multi-batch overlap: Additional overlap across GPU micro-batches within each pipeline iteration.

5. Inner Iterations: Multiple micro-batches can flow through the pipeline in a single pass before synchronization. This improves pipeline utilization by reducing the fraction of time stages are idle (the "pipeline bubble").

Usage

Apply this principle when serving transformer models that are too large for a single GPU's memory, or when distributing inference workloads to achieve higher throughput across multiple GPUs. It is the core architectural pattern behind FlexLLMGen's distributed inference capability.

Theoretical Basis

Pipeline Bubble Analysis

For a pipeline with N stages and a single micro-batch, the pipeline bubble fraction is:

bubble_fraction = (N - 1) / N

With M inner iterations (micro-batches) flowing through before synchronization, the effective bubble fraction becomes:

bubble_fraction = (N - 1) / (N - 1 + M)

As M increases, the bubble overhead diminishes. FlexLLMGen defaults M = N (one inner iteration per stage), yielding a bubble fraction of (N-1)/(2N-1), approximately 50% for large N.

Communication Volume

At each generation step, the hidden state tensor transferred between stages has shape:

(batch_size, seq_len, hidden_dim)

For the prefill phase, seq_len = prompt_len (potentially large). For subsequent decoding steps, seq_len = 1 (much smaller). The asymmetry means communication cost is dominated by the prefill step.

Overlapped Execution Model

The overlapped schedule pipelines four operations per layer per micro-batch:

Time step t:   load_weight(j+1)  |  load_cache(j+1)  |  compute(j)  |  store_cache(j-1)
Time step t+1: load_weight(j+2)  |  load_cache(j+2)  |  compute(j+1) |  store_cache(j)

By issuing weight and cache loads on separate CUDA streams from the compute stream, memory transfers overlap with GPU computation, reducing total wall-clock time.

Asynchronous Communication

When async_comm is enabled, isend and irecv are used instead of blocking send and recv. The futures are collected and awaited after all micro-batches for a given step are dispatched. This allows computation on the current micro-batch to overlap with communication for the previous or next micro-batch.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment