Implementation:FMInference FlexLLMGen OptLM Init

Metadata

Field	Value
Repo	FlexLLMGen

Domains

Inference_Optimization
Model_Loading

Overview

Concrete tool for loading OPT models with three-tier memory offloading provided by the FlexLLMGen library.

Description

OptLM.__init__ takes a model config (string or OptConfig), ExecutionEnv, weight path, and Policy. It builds the layer list (InputEmbed + N TransformerLayers + OutputEmbed, or with sep_layer=True: InputEmbed + N*(SelfAttention+MLP) + OutputEmbed). It allocates CUDA streams for async weight and cache loading, creates buffer arrays for cache_home, cache_read_buf, cache_write_buf, weight_read_buf, and attention_mask. Finally calls init_all_weights() to load weights from numpy files and distribute them per policy.

Usage

Create after ExecutionEnv and Policy are ready. If weights are not found at the path, they are automatically downloaded.

Code Reference

Field	Value
Source	flexllmgen/flex_opt.py, Lines: 582-637
Import	`from flexllmgen.flex_opt import OptLM`

Signature:

class OptLM:
    def __init__(self,
                 config: Union[str, OptConfig],
                 env: ExecutionEnv,
                 path: str,
                 policy: Policy):
        """
        Args:
            config: Model config or name string (auto-resolved via get_opt_config)
            env: ExecutionEnv with GPU/CPU/disk device handles
            path: Directory containing numpy weight files
            policy: Policy controlling memory offloading
        """

I/O Contract

Inputs:

Name	Type	Required	Description
config	Union[str, OptConfig]	Yes	Model config or name
env	ExecutionEnv	Yes	Hardware environment
path	str	Yes	Weight directory
policy	Policy	Yes	Offloading policy

Outputs:

OptLM (model instance with layers distributed across GPU/CPU/disk, CUDA streams allocated, weights loaded)

Usage Examples

from flexllmgen.flex_opt import OptLM, Policy, CompressionConfig
from flexllmgen.utils import ExecutionEnv

env = ExecutionEnv.create("~/flexllmgen_offload_dir")
policy = Policy(gpu_batch_size=2, num_gpu_batches=1,
                w_gpu_percent=50, w_cpu_percent=30,
                cache_gpu_percent=50, cache_cpu_percent=30,
                act_gpu_percent=100, act_cpu_percent=0,
                overlap=True, sep_layer=True, pin_weight=True,
                cpu_cache_compute=False, attn_sparsity=1.0,
                compress_weight=False,
                comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False),
                compress_cache=False,
                comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False))

model = OptLM("facebook/opt-30b", env, "~/opt_weights", policy)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment