Environment:FMInference FlexLLMGen CUDA GPU
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Linux environment with an NVIDIA CUDA-capable GPU, Python >= 3.7, PyTorch >= 1.12, and Transformers >= 4.24 for running FlexLLMGen offloaded inference.
Description
FlexLLMGen requires an NVIDIA GPU with CUDA support as the primary compute device. The system uses a three-tier memory hierarchy (GPU, CPU, disk) where the GPU executes all tensor computations including attention, MLP, and embedding layers. CUDA streams are used for overlapping I/O with compute in the distributed backend. All GPU tensors use float16 precision. The system hardcodes device cuda:0 as the primary GPU device.
Usage
Use this environment for all FlexLLMGen inference workloads, including single-GPU offloaded inference, text completion, HELM benchmark evaluation, and data wrangling batch inference. The GPU is mandatory even when most weights are offloaded to CPU or disk, because all computation (attention, MLP, output embedding) runs on the GPU.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Tested on Ubuntu; NVMe scripts target xfs filesystem |
| Hardware | NVIDIA GPU with CUDA | Minimum 16GB VRAM for OPT-6.7B fully on GPU; T4 16GB used in benchmarks |
| CPU Memory | 16GB+ DRAM | 90GB+ needed for OPT-30B with CPU offloading; 208GB used in benchmarks |
| Python | >= 3.7 | Specified in pyproject.toml |
Dependencies
System Packages
- NVIDIA GPU driver with CUDA support
xfsfilesystem tools (for NVMe mount scripts)lvm2(for GCP RAID-0 NVMe striping)
Python Packages
torch>= 1.12transformers>= 4.24numpytqdmpulpattrshuggingface_hub(for model weight downloading viasnapshot_download)
Credentials
No API tokens are strictly required for the core inference engine. See Environment:FMInference_FlexLLMGen_HuggingFace_Access for model download credentials.
Quick Install
# Install FlexLLMGen and all core dependencies
pip install flexllmgen
# Or install from source
pip install torch>=1.12 transformers>=4.24 numpy tqdm pulp attrs huggingface_hub
Code Evidence
GPU device hardcoded in flexllmgen/utils.py:46:
gpu = TorchDevice("cuda:0")
Three-tier execution environment factory in flexllmgen/utils.py:42-49:
@classmethod
def create(cls, offload_dir):
from flexllmgen.pytorch_backend import TorchDevice, TorchDisk, TorchMixedDevice
gpu = TorchDevice("cuda:0")
cpu = TorchDevice("cpu")
disk = TorchDisk(offload_dir)
return cls(gpu=gpu, cpu=cpu, disk=disk, mixed=TorchMixedDevice([gpu, cpu, disk]))
CUDA memory statistics in flexllmgen/pytorch_backend.py:589-597:
def mem_stats(self):
if self.device_type == DeviceType.CUDA:
cur_mem = torch.cuda.memory_allocated(self.dev)
peak_mem = torch.cuda.max_memory_allocated(self.dev)
elif self.device_type == DeviceType.CPU:
cur_mem = cpu_mem_stats()
peak_mem = 0
else:
raise NotImplementedError()
Python and package versions from pyproject.toml:10-17:
requires-python = ">=3.7"
dependencies = [
"torch>=1.12", "transformers>=4.24",
"numpy", "tqdm", "pulp", "attrs",
]
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
CUDA out of memory |
GPU VRAM insufficient for model + cache + activations | Tune --percent to offload more to CPU/disk; use --compress-weight; use --pin-weight 0
|
RuntimeError: No CUDA GPUs are available |
No NVIDIA GPU detected | Ensure NVIDIA driver and CUDA toolkit are installed |
AssertionError on data.device == device.dev |
Tensor on wrong device | Check that cuda:0 is available and not out of memory |
Compatibility Notes
- Single GPU only: The core engine hardcodes
cuda:0. Multi-GPU requires the separatedist_flex_opt.pydistributed backend. - Distributed backend: Uses NCCL for GPU communication and Gloo for CPU communication via
torch.distributed. - Float16 only: All GPU tensors and KV cache use
np.float16. CPU attention workspace usesnp.float32for precision. - Pinned memory: CPU tensors default to pinned memory for faster CPU-GPU transfers. GPU tensors never use pinned memory.
- Roadmap items: Macbook (M1/M2) and AMD GPU support are listed as planned but not yet implemented.