Implementation:FMInference FlexLLMGen DistOptLM
| Knowledge Sources | |
|---|---|
| Domains | Distributed Computing, Pipeline Parallelism, LLM Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Extends the single-GPU OptLM class to support distributed multi-GPU inference with pipeline parallelism, inter-stage hidden-state communication, and overlapped I/O scheduling.
Description
The DistOptLM class is a subclass of OptLM that partitions the OPT model's transformer layers across multiple pipeline stages (one per GPU). Each stage is responsible for a contiguous slice of the model: stage 0 additionally holds the InputEmbed layer, and the final stage holds the OutputEmbed layer. Intermediate stages hold only TransformerLayer (or separated SelfAttention + MLP) layers.
Key features include:
- Layer partitioning: Layers are divided as evenly as possible across num_pipeline_stages using a greedy remainder strategy.
- Inter-stage communication: send_hidden() and recv_hidden() transfer hidden-state tensors between adjacent ranks using PyTorch distributed send/recv (synchronous or asynchronous).
- Deadlock avoidance: send_recv_hidden() uses a rank-dependent ordering -- rank 0 receives before sending, all other ranks send before receiving -- to prevent circular wait.
- Generation loops: Three execution paths are provided: generation_loop_normal() (no overlap, for debugging), generation_loop_overlap_one_batch() (overlapped with a single GPU batch), and generation_loop_overlap_multi_batch() (overlapped with multiple GPU batches).
- Inner iterations: The concept of num_inner_iterations allows multiple micro-batches to flow through the pipeline before synchronizing, improving pipeline utilization.
The companion function run_flexllmgen_dist() orchestrates the full benchmark workflow: it creates the execution environment, constructs the policy, initializes the model, runs warmup, then benchmarks generation and reports throughput/latency metrics. The function add_distributed_parser_arguments() extends the CLI argument parser with distributed-specific flags.
Usage
Use DistOptLM when running OPT inference across multiple GPUs with pipeline parallelism. It is instantiated inside run_flexllmgen_dist() or can be used directly when building custom distributed inference scripts. The module is invoked as a script with MPI or manual rank assignment.
Code Reference
Source Location
- Repository: FMInference_FlexLLMGen
- File: flexllmgen/dist_flex_opt.py
- Lines: 1-691
Signature
class DistOptLM(OptLM):
def __init__(self, config, env, path, policy, pipeline_rank,
num_pipeline_stages, comm_device, num_inner_iterations=None,
async_comm=False):
...
def generate(self, inputs, max_new_tokens=32, do_sample=False,
temperature=1.0, stop=None, debug_mode=None,
cut_gen_len=None, verbose=0):
...
def run_flexllmgen_dist(args):
...
def add_distributed_parser_arguments(parser):
...
Import
from flexllmgen.dist_flex_opt import DistOptLM, run_flexllmgen_dist, add_distributed_parser_arguments
I/O Contract
Inputs (DistOptLM.__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| config | OPT config object | Yes | Model configuration (hidden size, number of layers, vocab size, etc.). |
| env | ExecutionEnv | Yes | Execution environment with GPU, CPU, and disk devices. |
| path | str | Yes | Path to model weights (or DUMMY_WEIGHT sentinel for benchmarking). |
| policy | Policy | Yes | Offloading and batching policy (gpu_batch_size, percent allocations, overlap, etc.). |
| pipeline_rank | int | Yes | Rank of this stage in the pipeline (0-indexed). |
| num_pipeline_stages | int | Yes | Total number of pipeline stages (equals world_size). |
| comm_device | str | Yes | Communication device: "cpu" or "gpu". |
| num_inner_iterations | int | No | Number of inner iterations per pipeline batch (defaults to num_pipeline_stages). |
| async_comm | bool | No | Whether to use asynchronous isend/irecv (defaults to False). |
Inputs (generate)
| Name | Type | Required | Description |
|---|---|---|---|
| inputs | np.array or List[List[int]] | Yes | Input token IDs, shape (num_prompts, prompt_len). |
| max_new_tokens | int | No | Number of tokens to generate (default 32). |
| do_sample | bool | No | Whether to sample (default False, greedy decoding). |
| temperature | float | No | Sampling temperature (default 1.0). |
| stop | int | No | Stop token ID (not implemented, must be None). |
| debug_mode | str | No | Debug mode string (optional). |
| cut_gen_len | int | No | Truncated generation length for latency projection. |
| verbose | int | No | Verbosity level (default 0). |
Outputs
| Name | Type | Description |
|---|---|---|
| output_ids | np.ndarray | Array of shape (num_prompts, prompt_len + gen_len) containing the input prompt followed by generated token IDs. Only the last pipeline rank has the complete output. |
Usage Examples
# Typical invocation via command line (MPI-based):
# mpirun -np 4 python -m flexllmgen.dist_flex_opt \
# --model facebook/opt-30b --path /weights/opt-30b \
# --head-ip 192.168.1.1 --port 29500 --use-mpi \
# --gpu-batch-size 4 --num-gpu-batches 2 \
# --percent 20 80 0 100 0 100 --comm-device gpu
# Programmatic usage within a distributed launcher:
from flexllmgen.dist_flex_opt import DistOptLM
from flexllmgen.flex_opt import Policy
from flexllmgen.opt_config import get_opt_config
config = get_opt_config("facebook/opt-30b")
policy = Policy(gpu_batch_size=4, num_gpu_batches=2,
w_gpu_percent=20, w_cpu_percent=80,
cache_gpu_percent=0, cache_cpu_percent=100,
act_gpu_percent=0, act_cpu_percent=100,
overlap=True, sep_layer=False, pin_weight=True,
cpu_cache_compute=False, attn_sparsity=1.0,
compress_weight=False, comp_weight_config=None,
compress_cache=False, comp_cache_config=None)
model = DistOptLM(config, env, "/weights/opt-30b", policy,
pipeline_rank=0, num_pipeline_stages=4,
comm_device="gpu")
output_ids = model.generate(inputs, max_new_tokens=32)