Heuristic:Mlc ai Web llm Low Resource Model Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deployment, Mobile |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Model selection strategy for VRAM-constrained devices using the `low_resource_required` flag, `vram_required_MB` field, and `maxStorageBufferBindingSize` detection.
Description
WebLLM's model registry tags each model with three resource-related fields: `low_resource_required` (boolean indicating if the model can run on mobile/limited devices), `vram_required_MB` (exact VRAM requirement), and `buffer_size_required_bytes` (required GPU buffer size). The engine also detects the device's `maxStorageBufferBindingSize` at runtime. When this buffer is less than 1 GB (common on mobile), only a handful of `-1k` context models work. Applications should use these fields to filter the model selection UI and prevent users from attempting to load models that will fail with device lost errors.
Usage
Use this heuristic when building model selection UIs, deploying to heterogeneous devices (desktop + mobile), or handling device lost errors by suggesting fallback models.
The Insight (Rule of Thumb)
- Action 1: Query `engine.getMaxStorageBufferBindingSize()` before model selection to detect device capabilities.
- Action 2: If `maxStorageBufferBindingSize < 1 GB`, filter models to only show those with `low_resource_required: true`.
- Action 3: After a `DeviceLostError`, suggest reloading with a smaller model or a `-1k` context variant.
- Value: Models tagged `low_resource_required: true` include Llama-3.2-1B variants (~880-1130 MB VRAM) and Llama-3.2-3B variants (~2200-2950 MB VRAM).
- Trade-off: Low-resource models have fewer parameters and smaller context windows, which reduces output quality and maximum conversation length.
Reasoning
Mobile GPUs and integrated graphics typically have strict buffer size limits (128-256 MB `maxStorageBufferBindingSize`) compared to desktop GPUs (2-4+ GB). Model weights and KV cache are stored in GPU storage buffers, so a model requiring buffers larger than the device limit will fail at load time. The `low_resource_required` flag pre-computes this compatibility check. The `-1k` suffix models use 1024-token context windows instead of 4096, reducing KV cache allocation by ~75%.
maxStorageBufferBindingSize detection from `src/engine.ts:1136-1162`:
const maxStorageBufferBindingSize =
gpuDetectOutput.device.limits.maxStorageBufferBindingSize;
const defaultMaxStorageBufferBindingSize = 1 << 30; // 1GB
if (maxStorageBufferBindingSize < defaultMaxStorageBufferBindingSize) {
log.warn(
`WARNING: the current maxStorageBufferBindingSize ` +
`(${computeMB(maxStorageBufferBindingSize)}) ` +
`may only work for a limited number of models, e.g.: \n` +
`- Llama-3.1-8B-Instruct-q4f16_1-MLC-1k \n` +
`- TinyLlama-1.1B-Chat-v0.4-q4f16_1-MLC-1k`,
);
}
ModelRecord resource fields from `src/config.ts:248-264`:
export interface ModelRecord {
model: string;
model_id: string;
model_lib: string;
overrides?: ChatOptions;
vram_required_MB?: number;
low_resource_required?: boolean;
buffer_size_required_bytes?: number;
required_features?: Array<string>;
model_type?: ModelType;
}
DeviceLostError guidance from `src/error.ts:267-273`:
export class DeviceLostError extends Error {
constructor() {
super(
"The WebGPU device was lost while loading the model. This issue often " +
"occurs due to running out of memory (OOM). To resolve this, try " +
"reloading with a model that has fewer parameters or uses a smaller " +
"context length.",
);
}
}