Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DistOptLM

From Leeroopedia


Knowledge Sources
Domains Distributed Computing, Pipeline Parallelism, LLM Inference
Last Updated 2026-02-09 12:00 GMT

Overview

Extends the single-GPU OptLM class to support distributed multi-GPU inference with pipeline parallelism, inter-stage hidden-state communication, and overlapped I/O scheduling.

Description

The DistOptLM class is a subclass of OptLM that partitions the OPT model's transformer layers across multiple pipeline stages (one per GPU). Each stage is responsible for a contiguous slice of the model: stage 0 additionally holds the InputEmbed layer, and the final stage holds the OutputEmbed layer. Intermediate stages hold only TransformerLayer (or separated SelfAttention + MLP) layers.

Key features include:

  • Layer partitioning: Layers are divided as evenly as possible across num_pipeline_stages using a greedy remainder strategy.
  • Inter-stage communication: send_hidden() and recv_hidden() transfer hidden-state tensors between adjacent ranks using PyTorch distributed send/recv (synchronous or asynchronous).
  • Deadlock avoidance: send_recv_hidden() uses a rank-dependent ordering -- rank 0 receives before sending, all other ranks send before receiving -- to prevent circular wait.
  • Generation loops: Three execution paths are provided: generation_loop_normal() (no overlap, for debugging), generation_loop_overlap_one_batch() (overlapped with a single GPU batch), and generation_loop_overlap_multi_batch() (overlapped with multiple GPU batches).
  • Inner iterations: The concept of num_inner_iterations allows multiple micro-batches to flow through the pipeline before synchronizing, improving pipeline utilization.

The companion function run_flexllmgen_dist() orchestrates the full benchmark workflow: it creates the execution environment, constructs the policy, initializes the model, runs warmup, then benchmarks generation and reports throughput/latency metrics. The function add_distributed_parser_arguments() extends the CLI argument parser with distributed-specific flags.

Usage

Use DistOptLM when running OPT inference across multiple GPUs with pipeline parallelism. It is instantiated inside run_flexllmgen_dist() or can be used directly when building custom distributed inference scripts. The module is invoked as a script with MPI or manual rank assignment.

Code Reference

Source Location

Signature

class DistOptLM(OptLM):
    def __init__(self, config, env, path, policy, pipeline_rank,
                 num_pipeline_stages, comm_device, num_inner_iterations=None,
                 async_comm=False):
        ...

    def generate(self, inputs, max_new_tokens=32, do_sample=False,
                 temperature=1.0, stop=None, debug_mode=None,
                 cut_gen_len=None, verbose=0):
        ...

def run_flexllmgen_dist(args):
    ...

def add_distributed_parser_arguments(parser):
    ...

Import

from flexllmgen.dist_flex_opt import DistOptLM, run_flexllmgen_dist, add_distributed_parser_arguments

I/O Contract

Inputs (DistOptLM.__init__)

Name Type Required Description
config OPT config object Yes Model configuration (hidden size, number of layers, vocab size, etc.).
env ExecutionEnv Yes Execution environment with GPU, CPU, and disk devices.
path str Yes Path to model weights (or DUMMY_WEIGHT sentinel for benchmarking).
policy Policy Yes Offloading and batching policy (gpu_batch_size, percent allocations, overlap, etc.).
pipeline_rank int Yes Rank of this stage in the pipeline (0-indexed).
num_pipeline_stages int Yes Total number of pipeline stages (equals world_size).
comm_device str Yes Communication device: "cpu" or "gpu".
num_inner_iterations int No Number of inner iterations per pipeline batch (defaults to num_pipeline_stages).
async_comm bool No Whether to use asynchronous isend/irecv (defaults to False).

Inputs (generate)

Name Type Required Description
inputs np.array or List[List[int]] Yes Input token IDs, shape (num_prompts, prompt_len).
max_new_tokens int No Number of tokens to generate (default 32).
do_sample bool No Whether to sample (default False, greedy decoding).
temperature float No Sampling temperature (default 1.0).
stop int No Stop token ID (not implemented, must be None).
debug_mode str No Debug mode string (optional).
cut_gen_len int No Truncated generation length for latency projection.
verbose int No Verbosity level (default 0).

Outputs

Name Type Description
output_ids np.ndarray Array of shape (num_prompts, prompt_len + gen_len) containing the input prompt followed by generated token IDs. Only the last pipeline rank has the complete output.

Usage Examples

# Typical invocation via command line (MPI-based):
# mpirun -np 4 python -m flexllmgen.dist_flex_opt \
#     --model facebook/opt-30b --path /weights/opt-30b \
#     --head-ip 192.168.1.1 --port 29500 --use-mpi \
#     --gpu-batch-size 4 --num-gpu-batches 2 \
#     --percent 20 80 0 100 0 100 --comm-device gpu

# Programmatic usage within a distributed launcher:
from flexllmgen.dist_flex_opt import DistOptLM
from flexllmgen.flex_opt import Policy
from flexllmgen.opt_config import get_opt_config

config = get_opt_config("facebook/opt-30b")
policy = Policy(gpu_batch_size=4, num_gpu_batches=2,
                w_gpu_percent=20, w_cpu_percent=80,
                cache_gpu_percent=0, cache_cpu_percent=100,
                act_gpu_percent=0, act_cpu_percent=100,
                overlap=True, sep_layer=False, pin_weight=True,
                cpu_cache_compute=False, attn_sparsity=1.0,
                compress_weight=False, comp_weight_config=None,
                compress_cache=False, comp_cache_config=None)

model = DistOptLM(config, env, "/weights/opt-30b", policy,
                  pipeline_rank=0, num_pipeline_stages=4,
                  comm_device="gpu")
output_ids = model.generate(inputs, max_new_tokens=32)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment