Principle:Mlc ai Web llm Model Selection

Overview

Model Selection is the technique of choosing pre-compiled machine learning models from a curated registry for browser-based inference using the WebGPU runtime. In the context of @mlc-ai/web-llm, model selection determines which model weights, compiled WASM libraries, and configuration overrides are loaded into the browser's GPU memory.

Description

Model selection in browser-based LLM inference involves choosing from a registry of pre-compiled models that are compatible with the WebGPU runtime. Each model entry in the registry specifies:

model -- the HuggingFace URL from which to download model weights
model_id -- a unique string identifier used throughout the application to reference the model
model_lib -- the URL of the compiled WASM library that contains the model's compute kernels
vram_required_MB -- the estimated video memory required to run the model
overrides -- optional partial configuration to override default values from mlc-chat-config.json
required_features -- WebGPU features the device must support (e.g., shader-f16)
model_type -- whether the model is an LLM, VLM (vision-language model), or embedding model

The selection process ensures the chosen model can be loaded given the device's capabilities. Models that require shader-f16 will not work on devices lacking that feature. Models flagged with low_resource_required: true are intended for constrained devices such as mobile phones.

The registry includes models from multiple families:

Llama (3.2, 3.1, 3, 2) in various quantization levels (q4f16_1, q4f32_1, q0f16)
Phi (3.5-mini, 3.5-vision, 3-mini, 2, 1.5)
Gemma (2-2b, 2-9b, 2-2b-jpn)
Qwen (3, 2.5, 2.5-Coder, 2.5-Math, 2)
Mistral and Hermes variants with function calling support
SmolLM2 for ultra-lightweight deployment
Embedding models such as Snowflake Arctic Embed

Usage

Use model selection when initializing a browser-based LLM application and you need to choose which model to load. Common scenarios include:

Presenting a list of available models to the user in a dropdown or configuration UI
Programmatically selecting a model based on the device's available VRAM
Filtering models by type (LLM, embedding, VLM) for specific use cases
Checking whether a model supports function calling before enabling tool-use features

The model registry provides compatibility guarantees between model weights and WASM libraries. A given model_id is always paired with a known-compatible model_lib, preventing mismatches that would cause runtime failures.

Theoretical Basis

Model registries map model identifiers to their artifacts (weights, compiled libraries) and hardware requirements. The registry pattern decouples model discovery from model loading, allowing applications to present model choices to users or programmatically select based on device constraints.

Key considerations in model selection include:

Quantization tradeoffs -- Lower-bit quantizations (q4f16_1) require less VRAM but may reduce quality; higher-precision quantizations (q0f16, q0f32) require more memory but preserve model fidelity
Context window sizing -- Models can be configured with different context window sizes (e.g., 1024 vs 4096 tokens), trading memory for sequence length capacity
Feature requirements -- Some quantizations require shader-f16 support, which is not universally available across WebGPU implementations
Model library sharing -- Multiple models with the same architecture can share a single compiled WASM library, reducing download overhead when switching between models of the same family

The model_lib URL encodes the model architecture, quantization format, context window size, and chunk size in its filename (e.g., Llama-3.2-1B-Instruct-q4f16_1-ctx4k_cs1k-webgpu.wasm), making artifact compatibility explicit.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment