Principle:Mlc ai Web llm Multi Model Routing

Overview

Multi-Model Routing is the pattern for routing inference requests to the correct model when multiple models are loaded concurrently in a single engine instance. This enables use cases like running both an LLM for chat and an embedding model for semantic search within the same application, using a single engine.

Description

The web-llm engine supports loading multiple models simultaneously via reload(modelId: string[]). When multiple models are loaded, every API call must specify which model to target via the model field in the request. The routing logic resolves which pipeline, configuration, and state to use for each request.

Model Selection Logic

The core model selection is handled by getModelIdToUse() in src/support.ts. The function follows these rules:

No models loaded: Throws ModelNotLoadedError
Model specified in request: Validates it exists among loaded models. If found, selects it; if not, throws SpecifiedModelNotFoundError
Model not specified, single model loaded: Automatically selects the only loaded model (no ambiguity)
Model not specified, multiple models loaded: Throws UnclearModelToUseError -- the caller must specify

This logic is invoked at multiple layers:

Proxy layer (WebWorkerMLCEngine): Before sending requests to the worker, to determine selectedModelId for generator keying
Engine layer (MLCEngine): Inside getModelStates() to select the correct pipeline and config
Worker handler layer (WebWorkerMLCEngineHandler): Implicitly through the engine's own routing when processing requests

Per-Model State Isolation

Each loaded model maintains isolated state through several maps in MLCEngine:

Map	Key	Value	Purpose
`loadedModelIdToPipeline`	`string`	EmbeddingPipeline	The inference pipeline
`loadedModelIdToChatConfig`	`string`	`ChatConfig`	Model-specific configuration
`loadedModelIdToModelType`	`string`	`ModelType`	LLM vs. embedding
`loadedModelIdToLock`	`string`	`CustomLock`	Per-model request serialization

In the worker handler, streaming generators are also maintained per-model:

Map	Key	Value	Purpose
`loadedModelIdToAsyncGenerator`	`string`	Completion>	Active streaming generator

This per-model isolation ensures that:

Requests to different models can be processed concurrently (different locks)
Each model maintains its own KV cache, conversation state, and generation statistics
Streaming from one model does not interfere with another model's generator

Routing Through the Worker Boundary

When operating through a Web Worker, model routing happens on both sides of the boundary:

Main thread (proxy):

Resolves selectedModelId from this.modelId[] and request.model
Sends selectedModelId with streaming init messages
Keys the proxy-side streaming on selectedModelId

Worker thread (handler):

Receives the full modelId[] and chatOpts[] with each request
Calls reloadIfUnmatched() to ensure state consistency
Delegates to MLCEngine, which performs its own model selection internally
Keys the worker-side generator map on selectedModelId

Model Type Enforcement

The engine enforces that the correct pipeline type is used for each API:

chatCompletion() and completion() require LLMChatPipeline -- throws IncorrectPipelineLoadedError otherwise
embedding() requires EmbeddingPipeline and ModelType.embedding -- throws EmbeddingUnsupportedModelError otherwise

This prevents accidental misuse such as sending chat requests to an embedding model.

Usage

Loading multiple models:

const engine = await CreateWebWorkerMLCEngine(
  worker,
  ["Llama-3.1-8B-Instruct-q4f16_1-MLC", "snowflake-arctic-embed-s-q0f32-MLC"]
);

Routing chat to the LLM:

const chatResponse = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",  // Required when multiple models loaded
});

Routing embedding to the embedding model:

const embedResponse = await engine.embeddings.create({
  input: "Hello world",
  model: "snowflake-arctic-embed-s-q0f32-MLC",  // Required when multiple models loaded
});

Single model (no routing needed):

const engine = await CreateWebWorkerMLCEngine(
  worker,
  "Llama-3.1-8B-Instruct-q4f16_1-MLC"
);

// model field is optional when only one model is loaded
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }],
});

Theoretical Basis

The multi-model routing pattern implements a form of virtual dispatch based on the model field in requests. Conceptually, the engine acts as a multiplexer: a single entry point that fans out to multiple isolated processing pipelines.

The routing function getModelIdToUse() implements the following decision tree:

getModelIdToUse(loadedModelIds, requestModel, requestName)
  |
  +-- loadedModelIds.length === 0
  |     -> throw ModelNotLoadedError
  |
  +-- requestModel is specified
  |     |
  |     +-- requestModel in loadedModelIds
  |     |     -> return requestModel
  |     |
  |     +-- requestModel NOT in loadedModelIds
  |           -> throw SpecifiedModelNotFoundError
  |
  +-- requestModel is NOT specified
        |
        +-- loadedModelIds.length === 1
        |     -> return loadedModelIds[0]
        |
        +-- loadedModelIds.length > 1
              -> throw UnclearModelToUseError

Key constraints:

Uniqueness: Model IDs must be unique when loading multiple models (ReloadModelIdNotUniqueError is thrown otherwise)
Sequential loading: Models are loaded one at a time during reload() to avoid GPU memory conflicts
Per-model locking: Each model has its own CustomLock so requests to different models do not block each other, but requests to the same model are serialized

Helper Functions for Per-Model Queries

Several engine methods accept an optional modelId parameter for querying model-specific state:

getMessage(modelId?) -- Get the current generated message for a specific model
runtimeStatsText(modelId?) -- Get runtime statistics for a specific model
resetChat(keepStats?, modelId?) -- Reset chat state for a specific model
forwardTokensAndSample(inputIds, isPrefill, modelId?) -- Low-level forwarding for a specific model

These all internally use getModelIdToUse() to resolve the target model.

Related Pages

Implementation:Mlc_ai_Web_llm_Get_Message_Model_Routing -- Concrete implementation of model-routed helper functions
Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Handler -- Worker handler that maintains per-model generators
Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- Proxy that performs client-side model selection
Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- How model selection integrates with request forwarding
Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Per-model generator management for streaming

Implementation:Mlc_ai_Web_llm_Get_Message_Model_Routing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment