Principle:Mlc ai Web llm Embedding Model Selection

Overview

Embedding Model Selection is the technique for choosing text embedding models from the web-llm model registry that encode text into dense vector representations, as opposed to selecting generative chat or vision-language models.

Description

In web-llm, every model available for loading is described by a ModelRecord entry in the prebuiltAppConfig.model_list array. Each ModelRecord carries an optional model_type field of type ModelType, which is an enum with three values:

ModelType.LLM -- standard large language models for chat and text generation (the default when model_type is omitted)
ModelType.embedding -- models compiled specifically for generating text embeddings
ModelType.VLM -- vision-language models that accept image inputs

Embedding model selection involves identifying and choosing models whose model_type is set to ModelType.embedding. When the engine loads such a model via CreateMLCEngine() or engine.reload(), it instantiates an EmbeddingPipeline instead of the standard LLMChatPipeline. This pipeline is optimized for forward-only inference (no autoregressive decoding) and supports batched input processing.

The primary embedding models available in the prebuilt registry are variants of Snowflake Arctic Embed, a family of transformer-based encoders that produce dense vector representations. These models are available in different size/batch configurations:

snowflake-arctic-embed-m -- the medium-sized encoder (~109M parameters)
snowflake-arctic-embed-s -- the small-sized encoder (~33M parameters)

Each variant is offered with different maximum batch sizes (denoted by the -b4 or -b32 suffix), which control the trade-off between throughput and VRAM consumption.

Usage

Use embedding model selection when building:

Semantic search applications that require encoding queries and documents into vector space
Retrieval-Augmented Generation (RAG) pipelines where document retrieval precedes LLM-based answer synthesis
Text similarity computation for clustering, deduplication, or recommendation systems
In-browser knowledge base applications that must run entirely client-side without server infrastructure

To select an embedding model, filter prebuiltAppConfig.model_list for entries where model_type === ModelType.embedding, then pass the chosen model_id to CreateMLCEngine().

Theoretical Basis

Model selection for embeddings involves understanding the trade-offs across three axes:

Model Size: Larger models (e.g., snowflake-arctic-embed-m, 109M parameters) generally produce higher quality embeddings with better semantic discrimination, but require more VRAM and have higher latency. Smaller models (e.g., snowflake-arctic-embed-s, 33M parameters) are faster and use less memory but may sacrifice some retrieval quality.

Batch Size: The -b4 variants are compiled with max_batch_size=4, consuming as little as ~239-539 MB of VRAM. The -b32 variants are compiled with max_batch_size=32, requiring ~1023-1408 MB of VRAM, but can process up to 32 inputs in a single GPU pass. Choosing a larger batch size is beneficial when embedding many documents at once; choosing a smaller batch size is appropriate when memory is constrained (e.g., mobile devices).

Context Window: All current embedding models are compiled with a context window of 512 tokens (ctx512). Inputs exceeding this length will cause an EmbeddingExceedContextWindowSizeError.

I/O Contract

Input:

A model_id string matching an entry in prebuiltAppConfig.model_list where model_type === ModelType.embedding

Output:

An initialized MLCEngine with an EmbeddingPipeline loaded, ready to accept embeddings.create() calls

Constraints:

The selected model must exist in the appConfig.model_list
The browser must support WebGPU
Sufficient VRAM must be available (see vram_required_MB on each ModelRecord)

Usage Examples

import {
  CreateMLCEngine,
  prebuiltAppConfig,
  ModelType,
} from "@mlc-ai/web-llm";

// 1. Discover available embedding models
const embeddingModels = prebuiltAppConfig.model_list.filter(
  (model) => model.model_type === ModelType.embedding,
);
console.log("Available embedding models:");
for (const m of embeddingModels) {
  console.log(`  ${m.model_id} (VRAM: ${m.vram_required_MB} MB)`);
}
// Output:
//   snowflake-arctic-embed-m-q0f32-MLC-b32 (VRAM: 1407.51 MB)
//   snowflake-arctic-embed-m-q0f32-MLC-b4 (VRAM: 539.4 MB)
//   snowflake-arctic-embed-s-q0f32-MLC-b32 (VRAM: 1022.82 MB)
//   snowflake-arctic-embed-s-q0f32-MLC-b4 (VRAM: 238.71 MB)

// 2. Select a model based on requirements
// For low-memory devices, use the small model with batch size 4
const selectedModelId = "snowflake-arctic-embed-s-q0f32-MLC-b4";

// 3. Load the embedding model
const engine = await CreateMLCEngine(selectedModelId, {
  initProgressCallback: (report) => {
    console.log(`Loading: ${report.text}`);
  },
});

// 4. Use the engine for embedding tasks
const result = await engine.embeddings.create({
  input: "Hello, world!",
});
console.log("Embedding dimension:", result.data[0].embedding.length);

// Selecting a high-throughput model for batch processing
const batchModelId = "snowflake-arctic-embed-m-q0f32-MLC-b32";
const batchEngine = await CreateMLCEngine(batchModelId);

// This model can process up to 32 inputs in a single GPU pass
const batchResult = await batchEngine.embeddings.create({
  input: Array.from({ length: 20 }, (_, i) => `Document number ${i}`),
});
console.log("Embedded", batchResult.data.length, "documents");

Related Pages

Implementation:Mlc_ai_Web_llm_Embedding_Model_Config -- Implementation:Mlc_ai_Web_llm_Embedding_Model_Config
Principle:Mlc_ai_Web_llm_Text_Embedding_Generation -- generating embeddings once a model is loaded
Principle:Mlc_ai_Web_llm_Embedding_Input_Formatting -- proper input formatting for embedding models
Principle:Mlc_ai_Web_llm_RAG_Pipeline -- using embedding models in Retrieval-Augmented Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment