Principle:Mlc ai Web llm Embedding Model Selection
Overview
Embedding Model Selection is the technique for choosing text embedding models from the web-llm model registry that encode text into dense vector representations, as opposed to selecting generative chat or vision-language models.
Description
In web-llm, every model available for loading is described by a ModelRecord entry in the prebuiltAppConfig.model_list array. Each ModelRecord carries an optional model_type field of type ModelType, which is an enum with three values:
ModelType.LLM-- standard large language models for chat and text generation (the default whenmodel_typeis omitted)ModelType.embedding-- models compiled specifically for generating text embeddingsModelType.VLM-- vision-language models that accept image inputs
Embedding model selection involves identifying and choosing models whose model_type is set to ModelType.embedding. When the engine loads such a model via CreateMLCEngine() or engine.reload(), it instantiates an EmbeddingPipeline instead of the standard LLMChatPipeline. This pipeline is optimized for forward-only inference (no autoregressive decoding) and supports batched input processing.
The primary embedding models available in the prebuilt registry are variants of Snowflake Arctic Embed, a family of transformer-based encoders that produce dense vector representations. These models are available in different size/batch configurations:
- snowflake-arctic-embed-m -- the medium-sized encoder (~109M parameters)
- snowflake-arctic-embed-s -- the small-sized encoder (~33M parameters)
Each variant is offered with different maximum batch sizes (denoted by the -b4 or -b32 suffix), which control the trade-off between throughput and VRAM consumption.
Usage
Use embedding model selection when building:
- Semantic search applications that require encoding queries and documents into vector space
- Retrieval-Augmented Generation (RAG) pipelines where document retrieval precedes LLM-based answer synthesis
- Text similarity computation for clustering, deduplication, or recommendation systems
- In-browser knowledge base applications that must run entirely client-side without server infrastructure
To select an embedding model, filter prebuiltAppConfig.model_list for entries where model_type === ModelType.embedding, then pass the chosen model_id to CreateMLCEngine().
Theoretical Basis
Model selection for embeddings involves understanding the trade-offs across three axes:
- Model Size
- Larger models (e.g.,
snowflake-arctic-embed-m, 109M parameters) generally produce higher quality embeddings with better semantic discrimination, but require more VRAM and have higher latency. Smaller models (e.g.,snowflake-arctic-embed-s, 33M parameters) are faster and use less memory but may sacrifice some retrieval quality.
- Batch Size
- The
-b4variants are compiled withmax_batch_size=4, consuming as little as ~239-539 MB of VRAM. The-b32variants are compiled withmax_batch_size=32, requiring ~1023-1408 MB of VRAM, but can process up to 32 inputs in a single GPU pass. Choosing a larger batch size is beneficial when embedding many documents at once; choosing a smaller batch size is appropriate when memory is constrained (e.g., mobile devices).
- Context Window
- All current embedding models are compiled with a context window of 512 tokens (
ctx512). Inputs exceeding this length will cause anEmbeddingExceedContextWindowSizeError.
I/O Contract
Input:
- A
model_idstring matching an entry inprebuiltAppConfig.model_listwheremodel_type === ModelType.embedding
Output:
- An initialized
MLCEnginewith anEmbeddingPipelineloaded, ready to acceptembeddings.create()calls
Constraints:
- The selected model must exist in the
appConfig.model_list - The browser must support WebGPU
- Sufficient VRAM must be available (see
vram_required_MBon eachModelRecord)
Usage Examples
import {
CreateMLCEngine,
prebuiltAppConfig,
ModelType,
} from "@mlc-ai/web-llm";
// 1. Discover available embedding models
const embeddingModels = prebuiltAppConfig.model_list.filter(
(model) => model.model_type === ModelType.embedding,
);
console.log("Available embedding models:");
for (const m of embeddingModels) {
console.log(` ${m.model_id} (VRAM: ${m.vram_required_MB} MB)`);
}
// Output:
// snowflake-arctic-embed-m-q0f32-MLC-b32 (VRAM: 1407.51 MB)
// snowflake-arctic-embed-m-q0f32-MLC-b4 (VRAM: 539.4 MB)
// snowflake-arctic-embed-s-q0f32-MLC-b32 (VRAM: 1022.82 MB)
// snowflake-arctic-embed-s-q0f32-MLC-b4 (VRAM: 238.71 MB)
// 2. Select a model based on requirements
// For low-memory devices, use the small model with batch size 4
const selectedModelId = "snowflake-arctic-embed-s-q0f32-MLC-b4";
// 3. Load the embedding model
const engine = await CreateMLCEngine(selectedModelId, {
initProgressCallback: (report) => {
console.log(`Loading: ${report.text}`);
},
});
// 4. Use the engine for embedding tasks
const result = await engine.embeddings.create({
input: "Hello, world!",
});
console.log("Embedding dimension:", result.data[0].embedding.length);
// Selecting a high-throughput model for batch processing
const batchModelId = "snowflake-arctic-embed-m-q0f32-MLC-b32";
const batchEngine = await CreateMLCEngine(batchModelId);
// This model can process up to 32 inputs in a single GPU pass
const batchResult = await batchEngine.embeddings.create({
input: Array.from({ length: 20 }, (_, i) => `Document number ${i}`),
});
console.log("Embedded", batchResult.data.length, "documents");
Related Pages
- Implementation:Mlc_ai_Web_llm_Embedding_Model_Config -- Implementation:Mlc_ai_Web_llm_Embedding_Model_Config
- Principle:Mlc_ai_Web_llm_Text_Embedding_Generation -- generating embeddings once a model is loaded
- Principle:Mlc_ai_Web_llm_Embedding_Input_Formatting -- proper input formatting for embedding models
- Principle:Mlc_ai_Web_llm_RAG_Pipeline -- using embedding models in Retrieval-Augmented Generation