Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:FMInference FlexLLMGen CUDA GPU

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Inference
Last Updated 2026-02-09 12:00 GMT

Overview

Linux environment with an NVIDIA CUDA-capable GPU, Python >= 3.7, PyTorch >= 1.12, and Transformers >= 4.24 for running FlexLLMGen offloaded inference.

Description

FlexLLMGen requires an NVIDIA GPU with CUDA support as the primary compute device. The system uses a three-tier memory hierarchy (GPU, CPU, disk) where the GPU executes all tensor computations including attention, MLP, and embedding layers. CUDA streams are used for overlapping I/O with compute in the distributed backend. All GPU tensors use float16 precision. The system hardcodes device cuda:0 as the primary GPU device.

Usage

Use this environment for all FlexLLMGen inference workloads, including single-GPU offloaded inference, text completion, HELM benchmark evaluation, and data wrangling batch inference. The GPU is mandatory even when most weights are offloaded to CPU or disk, because all computation (attention, MLP, output embedding) runs on the GPU.

System Requirements

Category Requirement Notes
OS Linux Tested on Ubuntu; NVMe scripts target xfs filesystem
Hardware NVIDIA GPU with CUDA Minimum 16GB VRAM for OPT-6.7B fully on GPU; T4 16GB used in benchmarks
CPU Memory 16GB+ DRAM 90GB+ needed for OPT-30B with CPU offloading; 208GB used in benchmarks
Python >= 3.7 Specified in pyproject.toml

Dependencies

System Packages

  • NVIDIA GPU driver with CUDA support
  • xfs filesystem tools (for NVMe mount scripts)
  • lvm2 (for GCP RAID-0 NVMe striping)

Python Packages

  • torch >= 1.12
  • transformers >= 4.24
  • numpy
  • tqdm
  • pulp
  • attrs
  • huggingface_hub (for model weight downloading via snapshot_download)

Credentials

No API tokens are strictly required for the core inference engine. See Environment:FMInference_FlexLLMGen_HuggingFace_Access for model download credentials.

Quick Install

# Install FlexLLMGen and all core dependencies
pip install flexllmgen

# Or install from source
pip install torch>=1.12 transformers>=4.24 numpy tqdm pulp attrs huggingface_hub

Code Evidence

GPU device hardcoded in flexllmgen/utils.py:46:

gpu = TorchDevice("cuda:0")

Three-tier execution environment factory in flexllmgen/utils.py:42-49:

@classmethod
def create(cls, offload_dir):
    from flexllmgen.pytorch_backend import TorchDevice, TorchDisk, TorchMixedDevice
    gpu = TorchDevice("cuda:0")
    cpu = TorchDevice("cpu")
    disk = TorchDisk(offload_dir)
    return cls(gpu=gpu, cpu=cpu, disk=disk, mixed=TorchMixedDevice([gpu, cpu, disk]))

CUDA memory statistics in flexllmgen/pytorch_backend.py:589-597:

def mem_stats(self):
    if self.device_type == DeviceType.CUDA:
        cur_mem = torch.cuda.memory_allocated(self.dev)
        peak_mem = torch.cuda.max_memory_allocated(self.dev)
    elif self.device_type == DeviceType.CPU:
        cur_mem = cpu_mem_stats()
        peak_mem = 0
    else:
        raise NotImplementedError()

Python and package versions from pyproject.toml:10-17:

requires-python = ">=3.7"
dependencies = [
    "torch>=1.12", "transformers>=4.24",
    "numpy", "tqdm", "pulp", "attrs",
]

Common Errors

Error Message Cause Solution
CUDA out of memory GPU VRAM insufficient for model + cache + activations Tune --percent to offload more to CPU/disk; use --compress-weight; use --pin-weight 0
RuntimeError: No CUDA GPUs are available No NVIDIA GPU detected Ensure NVIDIA driver and CUDA toolkit are installed
AssertionError on data.device == device.dev Tensor on wrong device Check that cuda:0 is available and not out of memory

Compatibility Notes

  • Single GPU only: The core engine hardcodes cuda:0. Multi-GPU requires the separate dist_flex_opt.py distributed backend.
  • Distributed backend: Uses NCCL for GPU communication and Gloo for CPU communication via torch.distributed.
  • Float16 only: All GPU tensors and KV cache use np.float16. CPU attention workspace uses np.float32 for precision.
  • Pinned memory: CPU tensors default to pinned memory for faster CPU-GPU transfers. GPU tensors never use pinned memory.
  • Roadmap items: Macbook (M1/M2) and AMD GPU support are listed as planned but not yet implemented.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment