Heuristic:Mlc ai Web llm Low Resource Model Selection

Knowledge Sources	mlc-ai/web-llm Source code analysis of config.ts model registry and engine.ts GPU detection
Domains	Optimization, Deployment, Mobile
Last Updated	2026-02-14 22:00 GMT

Overview

Model selection strategy for VRAM-constrained devices using the `low_resource_required` flag, `vram_required_MB` field, and `maxStorageBufferBindingSize` detection.

Description

WebLLM's model registry tags each model with three resource-related fields: `low_resource_required` (boolean indicating if the model can run on mobile/limited devices), `vram_required_MB` (exact VRAM requirement), and `buffer_size_required_bytes` (required GPU buffer size). The engine also detects the device's `maxStorageBufferBindingSize` at runtime. When this buffer is less than 1 GB (common on mobile), only a handful of `-1k` context models work. Applications should use these fields to filter the model selection UI and prevent users from attempting to load models that will fail with device lost errors.

Usage

Use this heuristic when building model selection UIs, deploying to heterogeneous devices (desktop + mobile), or handling device lost errors by suggesting fallback models.

The Insight (Rule of Thumb)

Action 1: Query `engine.getMaxStorageBufferBindingSize()` before model selection to detect device capabilities.
Action 2: If `maxStorageBufferBindingSize < 1 GB`, filter models to only show those with `low_resource_required: true`.
Action 3: After a `DeviceLostError`, suggest reloading with a smaller model or a `-1k` context variant.
Value: Models tagged `low_resource_required: true` include Llama-3.2-1B variants (~880-1130 MB VRAM) and Llama-3.2-3B variants (~2200-2950 MB VRAM).
Trade-off: Low-resource models have fewer parameters and smaller context windows, which reduces output quality and maximum conversation length.

Reasoning

Mobile GPUs and integrated graphics typically have strict buffer size limits (128-256 MB `maxStorageBufferBindingSize`) compared to desktop GPUs (2-4+ GB). Model weights and KV cache are stored in GPU storage buffers, so a model requiring buffers larger than the device limit will fail at load time. The `low_resource_required` flag pre-computes this compatibility check. The `-1k` suffix models use 1024-token context windows instead of 4096, reducing KV cache allocation by ~75%.

maxStorageBufferBindingSize detection from `src/engine.ts:1136-1162`:

const maxStorageBufferBindingSize =
  gpuDetectOutput.device.limits.maxStorageBufferBindingSize;
const defaultMaxStorageBufferBindingSize = 1 << 30; // 1GB
if (maxStorageBufferBindingSize < defaultMaxStorageBufferBindingSize) {
  log.warn(
    `WARNING: the current maxStorageBufferBindingSize ` +
      `(${computeMB(maxStorageBufferBindingSize)}) ` +
      `may only work for a limited number of models, e.g.: \n` +
      `- Llama-3.1-8B-Instruct-q4f16_1-MLC-1k \n` +
      `- TinyLlama-1.1B-Chat-v0.4-q4f16_1-MLC-1k`,
  );
}

ModelRecord resource fields from `src/config.ts:248-264`:

export interface ModelRecord {
  model: string;
  model_id: string;
  model_lib: string;
  overrides?: ChatOptions;
  vram_required_MB?: number;
  low_resource_required?: boolean;
  buffer_size_required_bytes?: number;
  required_features?: Array<string>;
  model_type?: ModelType;
}

DeviceLostError guidance from `src/error.ts:267-273`:

export class DeviceLostError extends Error {
  constructor() {
    super(
      "The WebGPU device was lost while loading the model. This issue often " +
        "occurs due to running out of memory (OOM). To resolve this, try " +
        "reloading with a model that has fewer parameters or uses a smaller " +
        "context length.",
    );
  }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment