Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FMInference FlexLLMGen OptLM Init

From Leeroopedia


Metadata

Field Value
Repo FlexLLMGen

Domains

  • Inference_Optimization
  • Model_Loading

Overview

Concrete tool for loading OPT models with three-tier memory offloading provided by the FlexLLMGen library.

Description

OptLM.__init__ takes a model config (string or OptConfig), ExecutionEnv, weight path, and Policy. It builds the layer list (InputEmbed + N TransformerLayers + OutputEmbed, or with sep_layer=True: InputEmbed + N*(SelfAttention+MLP) + OutputEmbed). It allocates CUDA streams for async weight and cache loading, creates buffer arrays for cache_home, cache_read_buf, cache_write_buf, weight_read_buf, and attention_mask. Finally calls init_all_weights() to load weights from numpy files and distribute them per policy.

Usage

Create after ExecutionEnv and Policy are ready. If weights are not found at the path, they are automatically downloaded.

Code Reference

Field Value
Source flexllmgen/flex_opt.py, Lines: 582-637
Import from flexllmgen.flex_opt import OptLM

Signature:

class OptLM:
    def __init__(self,
                 config: Union[str, OptConfig],
                 env: ExecutionEnv,
                 path: str,
                 policy: Policy):
        """
        Args:
            config: Model config or name string (auto-resolved via get_opt_config)
            env: ExecutionEnv with GPU/CPU/disk device handles
            path: Directory containing numpy weight files
            policy: Policy controlling memory offloading
        """

I/O Contract

Inputs:

Name Type Required Description
config Union[str, OptConfig] Yes Model config or name
env ExecutionEnv Yes Hardware environment
path str Yes Weight directory
policy Policy Yes Offloading policy

Outputs:

OptLM (model instance with layers distributed across GPU/CPU/disk, CUDA streams allocated, weights loaded)

Usage Examples

from flexllmgen.flex_opt import OptLM, Policy, CompressionConfig
from flexllmgen.utils import ExecutionEnv

env = ExecutionEnv.create("~/flexllmgen_offload_dir")
policy = Policy(gpu_batch_size=2, num_gpu_batches=1,
                w_gpu_percent=50, w_cpu_percent=30,
                cache_gpu_percent=50, cache_cpu_percent=30,
                act_gpu_percent=100, act_cpu_percent=0,
                overlap=True, sep_layer=True, pin_weight=True,
                cpu_cache_compute=False, attn_sparsity=1.0,
                compress_weight=False,
                comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False),
                compress_cache=False,
                comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False))

model = OptLM("facebook/opt-30b", env, "~/opt_weights", policy)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment