Heuristic:Mlc ai Mlc llm OpenCL Memory Floor Workaround

Knowledge Sources	MLC-LLM
Domains	Debugging, Mobile, Memory_Management
Last Updated	2026-02-09 19:00 GMT

Overview

Android OpenCL runtime under-reports available GPU memory; MLC-LLM applies a 5GB minimum floor to enable 7B/8B model deployment on capable devices.

Description

The OpenCL runtime on Android devices reports a `kTotalGlobalMemory` value that is significantly smaller than the actual available GPU memory. Without correction, this causes the engine memory estimator to conclude that there is insufficient memory for 7B/8B parameter models, even on devices that can run them. MLC-LLM works around this by applying a minimum floor of 5GB for all OpenCL devices, ensuring that the memory budget calculation uses a realistic value.

Usage

Be aware of this workaround when debugging memory estimation issues on Android, or when the engine reports more available memory than the device claims. This is expected behavior on OpenCL devices.

The Insight (Rule of Thumb)

Action: No user action needed. The 5GB floor is applied automatically for OpenCL devices.
Value: Minimum 5GB (`5 * 1024 * 1024 * 1024` bytes) reported GPU memory.
Trade-off: May over-allocate memory on devices with less than 5GB actual GPU memory, potentially causing OOM. However, devices with less than 5GB GPU memory are unlikely to run 7B models regardless.
Debugging tip: If a model fails on Android with OOM despite the memory floor, the device genuinely lacks sufficient memory. Try smaller models or more aggressive quantization.

Reasoning

The OpenCL standard does not guarantee that `CL_DEVICE_GLOBAL_MEM_SIZE` reports the full GPU memory available to a process. On Android with shared (unified) memory architectures, the reported value often reflects the OpenCL buffer allocation limit rather than total physical memory. Since MLC-LLM's memory estimator uses this value to determine KV cache capacity, an artificially low value would prevent the engine from utilizing available resources. The 5GB floor was chosen as the minimum viable memory for running 7B/8B quantized models.

// From config.cc:28-41
uint64_t TotalDetectGlobalMemory(DLDevice device) {
  tvm::ffi::Any rv;
  DeviceAPI::Get(device)->GetAttr(device, DeviceAttrKind::kTotalGlobalMemory, &rv);
  int64_t gpu_size_bytes = rv.cast<int64_t>();
  // Since the memory size returned by the OpenCL runtime is smaller than the actual available
  // memory space, we set a best available space so that MLC LLM can run 7B or 8B models on Android
  // with OpenCL.
  if (device.device_type == kDLOpenCL) {
    int64_t min_size_bytes = 5LL * 1024 * 1024 * 1024;  //  Minimum size is 5 GB
    gpu_size_bytes = std::max(gpu_size_bytes, min_size_bytes);
  }
  return gpu_size_bytes;
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment