Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Web llm Multi Model Routing

From Leeroopedia

Template:Metadata

Overview

Multi-Model Routing is the pattern for routing inference requests to the correct model when multiple models are loaded concurrently in a single engine instance. This enables use cases like running both an LLM for chat and an embedding model for semantic search within the same application, using a single engine.

Description

The web-llm engine supports loading multiple models simultaneously via reload(modelId: string[]). When multiple models are loaded, every API call must specify which model to target via the model field in the request. The routing logic resolves which pipeline, configuration, and state to use for each request.

Model Selection Logic

The core model selection is handled by getModelIdToUse() in src/support.ts. The function follows these rules:

  1. No models loaded: Throws ModelNotLoadedError
  2. Model specified in request: Validates it exists among loaded models. If found, selects it; if not, throws SpecifiedModelNotFoundError
  3. Model not specified, single model loaded: Automatically selects the only loaded model (no ambiguity)
  4. Model not specified, multiple models loaded: Throws UnclearModelToUseError -- the caller must specify

This logic is invoked at multiple layers:

  • Proxy layer (WebWorkerMLCEngine): Before sending requests to the worker, to determine selectedModelId for generator keying
  • Engine layer (MLCEngine): Inside getModelStates() to select the correct pipeline and config
  • Worker handler layer (WebWorkerMLCEngineHandler): Implicitly through the engine's own routing when processing requests

Per-Model State Isolation

Each loaded model maintains isolated state through several maps in MLCEngine:

Map Key Value Purpose
loadedModelIdToPipeline string EmbeddingPipeline The inference pipeline
loadedModelIdToChatConfig string ChatConfig Model-specific configuration
loadedModelIdToModelType string ModelType LLM vs. embedding
loadedModelIdToLock string CustomLock Per-model request serialization

In the worker handler, streaming generators are also maintained per-model:

Map Key Value Purpose
loadedModelIdToAsyncGenerator string Completion> Active streaming generator

This per-model isolation ensures that:

  • Requests to different models can be processed concurrently (different locks)
  • Each model maintains its own KV cache, conversation state, and generation statistics
  • Streaming from one model does not interfere with another model's generator

Routing Through the Worker Boundary

When operating through a Web Worker, model routing happens on both sides of the boundary:

Main thread (proxy):

  • Resolves selectedModelId from this.modelId[] and request.model
  • Sends selectedModelId with streaming init messages
  • Keys the proxy-side streaming on selectedModelId

Worker thread (handler):

  • Receives the full modelId[] and chatOpts[] with each request
  • Calls reloadIfUnmatched() to ensure state consistency
  • Delegates to MLCEngine, which performs its own model selection internally
  • Keys the worker-side generator map on selectedModelId

Model Type Enforcement

The engine enforces that the correct pipeline type is used for each API:

  • chatCompletion() and completion() require LLMChatPipeline -- throws IncorrectPipelineLoadedError otherwise
  • embedding() requires EmbeddingPipeline and ModelType.embedding -- throws EmbeddingUnsupportedModelError otherwise

This prevents accidental misuse such as sending chat requests to an embedding model.

Usage

Loading multiple models:

const engine = await CreateWebWorkerMLCEngine(
  worker,
  ["Llama-3.1-8B-Instruct-q4f16_1-MLC", "snowflake-arctic-embed-s-q0f32-MLC"]
);

Routing chat to the LLM:

const chatResponse = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",  // Required when multiple models loaded
});

Routing embedding to the embedding model:

const embedResponse = await engine.embeddings.create({
  input: "Hello world",
  model: "snowflake-arctic-embed-s-q0f32-MLC",  // Required when multiple models loaded
});

Single model (no routing needed):

const engine = await CreateWebWorkerMLCEngine(
  worker,
  "Llama-3.1-8B-Instruct-q4f16_1-MLC"
);

// model field is optional when only one model is loaded
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }],
});

Theoretical Basis

The multi-model routing pattern implements a form of virtual dispatch based on the model field in requests. Conceptually, the engine acts as a multiplexer: a single entry point that fans out to multiple isolated processing pipelines.

The routing function getModelIdToUse() implements the following decision tree:

getModelIdToUse(loadedModelIds, requestModel, requestName)
  |
  +-- loadedModelIds.length === 0
  |     -> throw ModelNotLoadedError
  |
  +-- requestModel is specified
  |     |
  |     +-- requestModel in loadedModelIds
  |     |     -> return requestModel
  |     |
  |     +-- requestModel NOT in loadedModelIds
  |           -> throw SpecifiedModelNotFoundError
  |
  +-- requestModel is NOT specified
        |
        +-- loadedModelIds.length === 1
        |     -> return loadedModelIds[0]
        |
        +-- loadedModelIds.length > 1
              -> throw UnclearModelToUseError

Key constraints:

  • Uniqueness: Model IDs must be unique when loading multiple models (ReloadModelIdNotUniqueError is thrown otherwise)
  • Sequential loading: Models are loaded one at a time during reload() to avoid GPU memory conflicts
  • Per-model locking: Each model has its own CustomLock so requests to different models do not block each other, but requests to the same model are serialized

Helper Functions for Per-Model Queries

Several engine methods accept an optional modelId parameter for querying model-specific state:

  • getMessage(modelId?) -- Get the current generated message for a specific model
  • runtimeStatsText(modelId?) -- Get runtime statistics for a specific model
  • resetChat(keepStats?, modelId?) -- Reset chat state for a specific model
  • forwardTokensAndSample(inputIds, isPrefill, modelId?) -- Low-level forwarding for a specific model

These all internally use getModelIdToUse() to resolve the target model.

Related Pages

Implementation:Mlc_ai_Web_llm_Get_Message_Model_Routing

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment