Environment:FMInference FlexLLMGen CUDA GPU

Knowledge Sources	FlexLLMGen PyTorch CUDA
Domains	Infrastructure, LLM_Inference
Last Updated	2026-02-09 12:00 GMT

Overview

Linux environment with an NVIDIA CUDA-capable GPU, Python >= 3.7, PyTorch >= 1.12, and Transformers >= 4.24 for running FlexLLMGen offloaded inference.

Description

FlexLLMGen requires an NVIDIA GPU with CUDA support as the primary compute device. The system uses a three-tier memory hierarchy (GPU, CPU, disk) where the GPU executes all tensor computations including attention, MLP, and embedding layers. CUDA streams are used for overlapping I/O with compute in the distributed backend. All GPU tensors use float16 precision. The system hardcodes device cuda:0 as the primary GPU device.

Usage

Use this environment for all FlexLLMGen inference workloads, including single-GPU offloaded inference, text completion, HELM benchmark evaluation, and data wrangling batch inference. The GPU is mandatory even when most weights are offloaded to CPU or disk, because all computation (attention, MLP, output embedding) runs on the GPU.

System Requirements

Category	Requirement	Notes
OS	Linux	Tested on Ubuntu; NVMe scripts target xfs filesystem
Hardware	NVIDIA GPU with CUDA	Minimum 16GB VRAM for OPT-6.7B fully on GPU; T4 16GB used in benchmarks
CPU Memory	16GB+ DRAM	90GB+ needed for OPT-30B with CPU offloading; 208GB used in benchmarks
Python	>= 3.7	Specified in pyproject.toml

Dependencies

System Packages

NVIDIA GPU driver with CUDA support
xfs filesystem tools (for NVMe mount scripts)
lvm2 (for GCP RAID-0 NVMe striping)

Python Packages

torch >= 1.12
transformers >= 4.24
numpy
tqdm
pulp
attrs
huggingface_hub (for model weight downloading via snapshot_download)

Credentials

No API tokens are strictly required for the core inference engine. See Environment:FMInference_FlexLLMGen_HuggingFace_Access for model download credentials.

Quick Install

# Install FlexLLMGen and all core dependencies
pip install flexllmgen

# Or install from source
pip install torch>=1.12 transformers>=4.24 numpy tqdm pulp attrs huggingface_hub

Code Evidence

GPU device hardcoded in flexllmgen/utils.py:46:

gpu = TorchDevice("cuda:0")

Three-tier execution environment factory in flexllmgen/utils.py:42-49:

@classmethod
def create(cls, offload_dir):
    from flexllmgen.pytorch_backend import TorchDevice, TorchDisk, TorchMixedDevice
    gpu = TorchDevice("cuda:0")
    cpu = TorchDevice("cpu")
    disk = TorchDisk(offload_dir)
    return cls(gpu=gpu, cpu=cpu, disk=disk, mixed=TorchMixedDevice([gpu, cpu, disk]))

CUDA memory statistics in flexllmgen/pytorch_backend.py:589-597:

def mem_stats(self):
    if self.device_type == DeviceType.CUDA:
        cur_mem = torch.cuda.memory_allocated(self.dev)
        peak_mem = torch.cuda.max_memory_allocated(self.dev)
    elif self.device_type == DeviceType.CPU:
        cur_mem = cpu_mem_stats()
        peak_mem = 0
    else:
        raise NotImplementedError()

Python and package versions from pyproject.toml:10-17:

requires-python = ">=3.7"
dependencies = [
    "torch>=1.12", "transformers>=4.24",
    "numpy", "tqdm", "pulp", "attrs",
]

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	GPU VRAM insufficient for model + cache + activations	Tune `--percent` to offload more to CPU/disk; use `--compress-weight`; use `--pin-weight 0`
`RuntimeError: No CUDA GPUs are available`	No NVIDIA GPU detected	Ensure NVIDIA driver and CUDA toolkit are installed
`AssertionError` on `data.device == device.dev`	Tensor on wrong device	Check that cuda:0 is available and not out of memory

Compatibility Notes

Single GPU only: The core engine hardcodes cuda:0. Multi-GPU requires the separate dist_flex_opt.py distributed backend.
Distributed backend: Uses NCCL for GPU communication and Gloo for CPU communication via torch.distributed.
Float16 only: All GPU tensors and KV cache use np.float16. CPU attention workspace uses np.float32 for precision.
Pinned memory: CPU tensors default to pinned memory for faster CPU-GPU transfers. GPU tensors never use pinned memory.
Roadmap items: Macbook (M1/M2) and AMD GPU support are listed as planned but not yet implemented.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment