Workflow:Mlc ai Web llm Text Embeddings And RAG

Knowledge Sources	web-llm WebLLM Docs
Domains	LLMs, WebGPU, Embeddings, RAG, Semantic_Search
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for generating text embeddings in the browser and using them for semantic search or retrieval-augmented generation (RAG) pipelines, entirely client-side.

Description

This workflow covers web-llm's embedding model support, which runs embedding inference directly in the browser using WebGPU. The engine loads a dedicated embedding model (such as Snowflake Arctic Embed) and exposes an OpenAI-compatible embeddings.create() API. Embeddings can be used standalone for semantic similarity scoring or combined with an LLM in a single engine for full RAG pipelines. The workflow also demonstrates integration with LangChain.js for vector store operations and retrieval chains, all running client-side without any server.

Usage

Execute this workflow when you need to compute text embeddings in the browser for semantic search, document similarity, clustering, or retrieval-augmented generation. This is particularly valuable for privacy-sensitive applications where documents should not leave the user's device, or for offline-capable applications that need semantic understanding without network access.

Execution Steps

Step 1: Select an Embedding Model

Choose an embedding model from the web-llm registry. Embedding models are distinct from chat models and are identified by their model ID (e.g., "snowflake-arctic-embed-m-q0f32-MLC-b4"). The batch size suffix (e.g., "-b4") indicates the maximum number of inputs that can be processed in a single forward pass. Larger batch sizes consume more memory but process multiple inputs faster.

Key considerations:

Embedding models produce fixed-dimension vector representations of text
The batch size suffix determines maximum simultaneous inputs per forward pass
Inputs exceeding the batch size are automatically processed in multiple passes
Larger batch sizes (e.g., b32) use more memory but are more efficient for bulk embedding

Step 2: Create the Engine

Initialize the MLCEngine with the embedding model using CreateMLCEngine. For RAG pipelines that need both embeddings and chat completion, pass an array of model IDs containing both the embedding model and an LLM. The engine handles loading and managing multiple models simultaneously.

Single model:

CreateMLCEngine(embeddingModelId, config) for embeddings only

Multi-model (for RAG):

CreateMLCEngine([embeddingModelId, llmModelId], config) loads both models
Each model is accessible via its model ID in subsequent API calls

Step 3: Format Input Documents

Prepare the text inputs according to the embedding model's expected format. Some models require specific formatting with special tokens (e.g., "[CLS] text [SEP]") and query prefixes (e.g., "Represent this sentence for searching relevant passages:"). Consult the specific model's documentation for formatting requirements.

Key considerations:

Document formatting varies by model (some need CLS/SEP tokens, some do not)
Query inputs may need a task-specific prefix for asymmetric retrieval
Batch all inputs into a single array for efficient processing
The engine handles tokenization internally after receiving formatted strings

Step 4: Generate Embeddings

Call engine.embeddings.create() with the input texts. The API follows the OpenAI embeddings format, accepting a string or array of strings and returning embedding vectors. The response includes the embedding vectors, model name, and usage statistics (total tokens processed).

What happens:

Inputs are tokenized and batched according to the model's max batch size
The embedding model performs forward inference via WebGPU
Mean pooling is applied across token positions to produce fixed-size vectors
L2 normalization is applied to produce unit vectors
The response contains data[i].embedding arrays for each input

Step 5: Compute Similarity or Build Vector Store

Use the embedding vectors for downstream tasks. For semantic search, compute cosine similarity between query and document embeddings. For integration with LangChain.js, wrap the engine in an EmbeddingsInterface adapter and use it with MemoryVectorStore or other vector store implementations.

Key considerations:

Cosine similarity can be computed directly on the normalized embedding vectors
LangChain.js integration requires implementing the EmbeddingsInterface (embedQuery and embedDocuments methods)
MemoryVectorStore provides in-memory similarity search with no external dependencies
Similarity results can be used for document retrieval, ranking, or classification

Step 6: Build RAG Pipeline (Optional)

For retrieval-augmented generation, combine the embedding model with an LLM in a single engine. Use the embedding model to retrieve relevant documents from the vector store, format them into a context-augmented prompt, and send the prompt to the LLM for answer generation. LangChain.js RunnableSequence provides a composable API for this pipeline.

What happens:

A retriever is created from the vector store using asRetriever()
A prompt template combines retrieved context with the user's question
The chain retrieves documents, formats them, and produces the final prompt
The LLM generates an answer grounded in the retrieved context
Both models run entirely in the browser via the same engine

Execution Diagram

GitHub URL

Workflow Repository