Implementation:FMInference FlexLLMGen OptLM Init
Metadata
| Field | Value |
|---|---|
| Repo | FlexLLMGen |
Domains
- Inference_Optimization
- Model_Loading
Overview
Concrete tool for loading OPT models with three-tier memory offloading provided by the FlexLLMGen library.
Description
OptLM.__init__ takes a model config (string or OptConfig), ExecutionEnv, weight path, and Policy. It builds the layer list (InputEmbed + N TransformerLayers + OutputEmbed, or with sep_layer=True: InputEmbed + N*(SelfAttention+MLP) + OutputEmbed). It allocates CUDA streams for async weight and cache loading, creates buffer arrays for cache_home, cache_read_buf, cache_write_buf, weight_read_buf, and attention_mask. Finally calls init_all_weights() to load weights from numpy files and distribute them per policy.
Usage
Create after ExecutionEnv and Policy are ready. If weights are not found at the path, they are automatically downloaded.
Code Reference
| Field | Value |
|---|---|
| Source | flexllmgen/flex_opt.py, Lines: 582-637 |
| Import | from flexllmgen.flex_opt import OptLM
|
Signature:
class OptLM:
def __init__(self,
config: Union[str, OptConfig],
env: ExecutionEnv,
path: str,
policy: Policy):
"""
Args:
config: Model config or name string (auto-resolved via get_opt_config)
env: ExecutionEnv with GPU/CPU/disk device handles
path: Directory containing numpy weight files
policy: Policy controlling memory offloading
"""
I/O Contract
Inputs:
| Name | Type | Required | Description |
|---|---|---|---|
| config | Union[str, OptConfig] | Yes | Model config or name |
| env | ExecutionEnv | Yes | Hardware environment |
| path | str | Yes | Weight directory |
| policy | Policy | Yes | Offloading policy |
Outputs:
OptLM (model instance with layers distributed across GPU/CPU/disk, CUDA streams allocated, weights loaded)
Usage Examples
from flexllmgen.flex_opt import OptLM, Policy, CompressionConfig
from flexllmgen.utils import ExecutionEnv
env = ExecutionEnv.create("~/flexllmgen_offload_dir")
policy = Policy(gpu_batch_size=2, num_gpu_batches=1,
w_gpu_percent=50, w_cpu_percent=30,
cache_gpu_percent=50, cache_cpu_percent=30,
act_gpu_percent=100, act_cpu_percent=0,
overlap=True, sep_layer=True, pin_weight=True,
cpu_cache_compute=False, attn_sparsity=1.0,
compress_weight=False,
comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False),
compress_cache=False,
comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False))
model = OptLM("facebook/opt-30b", env, "~/opt_weights", policy)