Implementation:FMInference FlexLLMGen DistOptLM

Knowledge Sources	FMInference_FlexLLMGen
Domains	Distributed Computing, Pipeline Parallelism, LLM Inference
Last Updated	2026-02-09 12:00 GMT

Overview

Extends the single-GPU OptLM class to support distributed multi-GPU inference with pipeline parallelism, inter-stage hidden-state communication, and overlapped I/O scheduling.

Description

The DistOptLM class is a subclass of OptLM that partitions the OPT model's transformer layers across multiple pipeline stages (one per GPU). Each stage is responsible for a contiguous slice of the model: stage 0 additionally holds the InputEmbed layer, and the final stage holds the OutputEmbed layer. Intermediate stages hold only TransformerLayer (or separated SelfAttention + MLP) layers.

Key features include:

Layer partitioning: Layers are divided as evenly as possible across num_pipeline_stages using a greedy remainder strategy.
Inter-stage communication: send_hidden() and recv_hidden() transfer hidden-state tensors between adjacent ranks using PyTorch distributed send/recv (synchronous or asynchronous).
Deadlock avoidance: send_recv_hidden() uses a rank-dependent ordering -- rank 0 receives before sending, all other ranks send before receiving -- to prevent circular wait.
Generation loops: Three execution paths are provided: generation_loop_normal() (no overlap, for debugging), generation_loop_overlap_one_batch() (overlapped with a single GPU batch), and generation_loop_overlap_multi_batch() (overlapped with multiple GPU batches).
Inner iterations: The concept of num_inner_iterations allows multiple micro-batches to flow through the pipeline before synchronizing, improving pipeline utilization.

The companion function run_flexllmgen_dist() orchestrates the full benchmark workflow: it creates the execution environment, constructs the policy, initializes the model, runs warmup, then benchmarks generation and reports throughput/latency metrics. The function add_distributed_parser_arguments() extends the CLI argument parser with distributed-specific flags.

Usage

Use DistOptLM when running OPT inference across multiple GPUs with pipeline parallelism. It is instantiated inside run_flexllmgen_dist() or can be used directly when building custom distributed inference scripts. The module is invoked as a script with MPI or manual rank assignment.

Code Reference

Source Location

Repository: FMInference_FlexLLMGen
File: flexllmgen/dist_flex_opt.py
Lines: 1-691

Signature

class DistOptLM(OptLM):
    def __init__(self, config, env, path, policy, pipeline_rank,
                 num_pipeline_stages, comm_device, num_inner_iterations=None,
                 async_comm=False):
        ...

    def generate(self, inputs, max_new_tokens=32, do_sample=False,
                 temperature=1.0, stop=None, debug_mode=None,
                 cut_gen_len=None, verbose=0):
        ...

def run_flexllmgen_dist(args):
    ...

def add_distributed_parser_arguments(parser):
    ...

Import

from flexllmgen.dist_flex_opt import DistOptLM, run_flexllmgen_dist, add_distributed_parser_arguments

I/O Contract

Inputs (DistOptLM.init)

Name	Type	Required	Description
config	OPT config object	Yes	Model configuration (hidden size, number of layers, vocab size, etc.).
env	ExecutionEnv	Yes	Execution environment with GPU, CPU, and disk devices.
path	str	Yes	Path to model weights (or DUMMY_WEIGHT sentinel for benchmarking).
policy	Policy	Yes	Offloading and batching policy (gpu_batch_size, percent allocations, overlap, etc.).
pipeline_rank	int	Yes	Rank of this stage in the pipeline (0-indexed).
num_pipeline_stages	int	Yes	Total number of pipeline stages (equals world_size).
comm_device	str	Yes	Communication device: "cpu" or "gpu".
num_inner_iterations	int	No	Number of inner iterations per pipeline batch (defaults to num_pipeline_stages).
async_comm	bool	No	Whether to use asynchronous isend/irecv (defaults to False).

Inputs (generate)

Name	Type	Required	Description
inputs	np.array or List[List[int]]	Yes	Input token IDs, shape (num_prompts, prompt_len).
max_new_tokens	int	No	Number of tokens to generate (default 32).
do_sample	bool	No	Whether to sample (default False, greedy decoding).
temperature	float	No	Sampling temperature (default 1.0).
stop	int	No	Stop token ID (not implemented, must be None).
debug_mode	str	No	Debug mode string (optional).
cut_gen_len	int	No	Truncated generation length for latency projection.
verbose	int	No	Verbosity level (default 0).

Outputs

Name	Type	Description
output_ids	np.ndarray	Array of shape (num_prompts, prompt_len + gen_len) containing the input prompt followed by generated token IDs. Only the last pipeline rank has the complete output.

Usage Examples

# Typical invocation via command line (MPI-based):
# mpirun -np 4 python -m flexllmgen.dist_flex_opt \
#     --model facebook/opt-30b --path /weights/opt-30b \
#     --head-ip 192.168.1.1 --port 29500 --use-mpi \
#     --gpu-batch-size 4 --num-gpu-batches 2 \
#     --percent 20 80 0 100 0 100 --comm-device gpu

# Programmatic usage within a distributed launcher:
from flexllmgen.dist_flex_opt import DistOptLM
from flexllmgen.flex_opt import Policy
from flexllmgen.opt_config import get_opt_config

config = get_opt_config("facebook/opt-30b")
policy = Policy(gpu_batch_size=4, num_gpu_batches=2,
                w_gpu_percent=20, w_cpu_percent=80,
                cache_gpu_percent=0, cache_cpu_percent=100,
                act_gpu_percent=0, act_cpu_percent=100,
                overlap=True, sep_layer=False, pin_weight=True,
                cpu_cache_compute=False, attn_sparsity=1.0,
                compress_weight=False, comp_weight_config=None,
                compress_cache=False, comp_cache_config=None)

model = DistOptLM(config, env, "/weights/opt-30b", policy,
                  pipeline_rank=0, num_pipeline_stages=4,
                  comm_device="gpu")
output_ids = model.generate(inputs, max_new_tokens=32)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment