Principle:FMInference FlexLLMGen Execution Environment Initialization

Field	Value
Sources	Repo: FlexLLMGen
Domains	System_Initialization, Hardware_Abstraction
Last Updated	2026-02-09 00:00 GMT

Overview

A hardware abstraction pattern that creates a unified interface over GPU, CPU, and disk devices to support transparent tensor placement across a three-tier memory hierarchy.

Description

Before loading a model, the system must initialize device handles for each tier of the memory hierarchy:

GPU device -- A CUDA-capable GPU for compute-intensive operations and high-speed tensor storage.
CPU device -- System DRAM, with optional pinned (page-locked) memory for faster GPU transfers.
Disk device -- An NVMe SSD-backed directory for storing tensors that do not fit in GPU or CPU memory.

These device handles are bundled into an ExecutionEnv frozen dataclass that is passed to all model components. The environment provides a uniform interface for tensor allocation, loading, and storage regardless of which physical device backs the operation.

The key design properties of this pattern are:

Transparent placement -- Model components interact with device handles through a common interface, without needing to know the underlying hardware specifics.
Mixed-device support -- A special TorchMixedDevice handle enables tensors to be split across multiple devices simultaneously.
Lazy initialization -- Device handles are created once at startup and reused throughout the inference session.
Immutable configuration -- The frozen dataclass ensures device assignments cannot change during inference.

Usage

Always initialize ExecutionEnv as the first step before creating an OptLM model. The offload directory should be on a fast NVMe SSD for best performance. The environment object is required by both the model constructor and the inference loop.

Typical initialization order:

Create ExecutionEnv with an offload directory path.
Create a Policy specifying tensor placement percentages.
Resolve model configuration via get_opt_config().
Initialize OptLM with the environment, policy, and config.

Theoretical Basis

The execution environment abstracts hardware into a device DAG (directed acyclic graph):

GPU <-> CPU <-> Disk

Each device handle provides core operations:

TorchDevice (for GPU and CPU) -- Wraps PyTorch device semantics with allocate, load, and store operations.
TorchDisk (for disk) -- Provides file-backed tensor storage with memory-mapped I/O.
TorchMixedDevice -- Handles tensors that are split across multiple devices, coordinating transfers as needed.

Data flows between adjacent tiers: GPU transfers to/from CPU, and CPU transfers to/from disk. Direct GPU-to-disk transfers are not supported; data must pass through CPU as an intermediary. This hierarchical transfer model matches the physical hardware topology (PCIe for GPU-CPU, NVMe for CPU-disk).

Related Pages

Implementation:FMInference_FlexLLMGen_ExecutionEnv_Create

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment