Principle:Mlc ai Web llm Multi Model Routing
Overview
Multi-Model Routing is the pattern for routing inference requests to the correct model when multiple models are loaded concurrently in a single engine instance. This enables use cases like running both an LLM for chat and an embedding model for semantic search within the same application, using a single engine.
Description
The web-llm engine supports loading multiple models simultaneously via reload(modelId: string[]). When multiple models are loaded, every API call must specify which model to target via the model field in the request. The routing logic resolves which pipeline, configuration, and state to use for each request.
Model Selection Logic
The core model selection is handled by getModelIdToUse() in src/support.ts. The function follows these rules:
- No models loaded: Throws
ModelNotLoadedError - Model specified in request: Validates it exists among loaded models. If found, selects it; if not, throws
SpecifiedModelNotFoundError - Model not specified, single model loaded: Automatically selects the only loaded model (no ambiguity)
- Model not specified, multiple models loaded: Throws
UnclearModelToUseError-- the caller must specify
This logic is invoked at multiple layers:
- Proxy layer (
WebWorkerMLCEngine): Before sending requests to the worker, to determineselectedModelIdfor generator keying - Engine layer (
MLCEngine): InsidegetModelStates()to select the correct pipeline and config - Worker handler layer (
WebWorkerMLCEngineHandler): Implicitly through the engine's own routing when processing requests
Per-Model State Isolation
Each loaded model maintains isolated state through several maps in MLCEngine:
| Map | Key | Value | Purpose |
|---|---|---|---|
loadedModelIdToPipeline |
string |
EmbeddingPipeline | The inference pipeline |
loadedModelIdToChatConfig |
string |
ChatConfig |
Model-specific configuration |
loadedModelIdToModelType |
string |
ModelType |
LLM vs. embedding |
loadedModelIdToLock |
string |
CustomLock |
Per-model request serialization |
In the worker handler, streaming generators are also maintained per-model:
| Map | Key | Value | Purpose |
|---|---|---|---|
loadedModelIdToAsyncGenerator |
string |
Completion> | Active streaming generator |
This per-model isolation ensures that:
- Requests to different models can be processed concurrently (different locks)
- Each model maintains its own KV cache, conversation state, and generation statistics
- Streaming from one model does not interfere with another model's generator
Routing Through the Worker Boundary
When operating through a Web Worker, model routing happens on both sides of the boundary:
Main thread (proxy):
- Resolves
selectedModelIdfromthis.modelId[]andrequest.model - Sends
selectedModelIdwith streaming init messages - Keys the proxy-side streaming on
selectedModelId
Worker thread (handler):
- Receives the full
modelId[]andchatOpts[]with each request - Calls
reloadIfUnmatched()to ensure state consistency - Delegates to
MLCEngine, which performs its own model selection internally - Keys the worker-side generator map on
selectedModelId
Model Type Enforcement
The engine enforces that the correct pipeline type is used for each API:
chatCompletion()andcompletion()requireLLMChatPipeline-- throwsIncorrectPipelineLoadedErrorotherwiseembedding()requiresEmbeddingPipelineandModelType.embedding-- throwsEmbeddingUnsupportedModelErrorotherwise
This prevents accidental misuse such as sending chat requests to an embedding model.
Usage
Loading multiple models:
const engine = await CreateWebWorkerMLCEngine(
worker,
["Llama-3.1-8B-Instruct-q4f16_1-MLC", "snowflake-arctic-embed-s-q0f32-MLC"]
);
Routing chat to the LLM:
const chatResponse = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello!" }],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC", // Required when multiple models loaded
});
Routing embedding to the embedding model:
const embedResponse = await engine.embeddings.create({
input: "Hello world",
model: "snowflake-arctic-embed-s-q0f32-MLC", // Required when multiple models loaded
});
Single model (no routing needed):
const engine = await CreateWebWorkerMLCEngine(
worker,
"Llama-3.1-8B-Instruct-q4f16_1-MLC"
);
// model field is optional when only one model is loaded
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello!" }],
});
Theoretical Basis
The multi-model routing pattern implements a form of virtual dispatch based on the model field in requests. Conceptually, the engine acts as a multiplexer: a single entry point that fans out to multiple isolated processing pipelines.
The routing function getModelIdToUse() implements the following decision tree:
getModelIdToUse(loadedModelIds, requestModel, requestName)
|
+-- loadedModelIds.length === 0
| -> throw ModelNotLoadedError
|
+-- requestModel is specified
| |
| +-- requestModel in loadedModelIds
| | -> return requestModel
| |
| +-- requestModel NOT in loadedModelIds
| -> throw SpecifiedModelNotFoundError
|
+-- requestModel is NOT specified
|
+-- loadedModelIds.length === 1
| -> return loadedModelIds[0]
|
+-- loadedModelIds.length > 1
-> throw UnclearModelToUseError
Key constraints:
- Uniqueness: Model IDs must be unique when loading multiple models (
ReloadModelIdNotUniqueErroris thrown otherwise) - Sequential loading: Models are loaded one at a time during
reload()to avoid GPU memory conflicts - Per-model locking: Each model has its own
CustomLockso requests to different models do not block each other, but requests to the same model are serialized
Helper Functions for Per-Model Queries
Several engine methods accept an optional modelId parameter for querying model-specific state:
getMessage(modelId?)-- Get the current generated message for a specific modelruntimeStatsText(modelId?)-- Get runtime statistics for a specific modelresetChat(keepStats?, modelId?)-- Reset chat state for a specific modelforwardTokensAndSample(inputIds, isPrefill, modelId?)-- Low-level forwarding for a specific model
These all internally use getModelIdToUse() to resolve the target model.
Related Pages
- Implementation:Mlc_ai_Web_llm_Get_Message_Model_Routing -- Concrete implementation of model-routed helper functions
- Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Handler -- Worker handler that maintains per-model generators
- Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- Proxy that performs client-side model selection
- Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- How model selection integrates with request forwarding
- Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Per-model generator management for streaming