Workflow:Mlc ai Web llm Text Embeddings And RAG
| Knowledge Sources | |
|---|---|
| Domains | LLMs, WebGPU, Embeddings, RAG, Semantic_Search |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for generating text embeddings in the browser and using them for semantic search or retrieval-augmented generation (RAG) pipelines, entirely client-side.
Description
This workflow covers web-llm's embedding model support, which runs embedding inference directly in the browser using WebGPU. The engine loads a dedicated embedding model (such as Snowflake Arctic Embed) and exposes an OpenAI-compatible embeddings.create() API. Embeddings can be used standalone for semantic similarity scoring or combined with an LLM in a single engine for full RAG pipelines. The workflow also demonstrates integration with LangChain.js for vector store operations and retrieval chains, all running client-side without any server.
Usage
Execute this workflow when you need to compute text embeddings in the browser for semantic search, document similarity, clustering, or retrieval-augmented generation. This is particularly valuable for privacy-sensitive applications where documents should not leave the user's device, or for offline-capable applications that need semantic understanding without network access.
Execution Steps
Step 1: Select an Embedding Model
Choose an embedding model from the web-llm registry. Embedding models are distinct from chat models and are identified by their model ID (e.g., "snowflake-arctic-embed-m-q0f32-MLC-b4"). The batch size suffix (e.g., "-b4") indicates the maximum number of inputs that can be processed in a single forward pass. Larger batch sizes consume more memory but process multiple inputs faster.
Key considerations:
- Embedding models produce fixed-dimension vector representations of text
- The batch size suffix determines maximum simultaneous inputs per forward pass
- Inputs exceeding the batch size are automatically processed in multiple passes
- Larger batch sizes (e.g., b32) use more memory but are more efficient for bulk embedding
Step 2: Create the Engine
Initialize the MLCEngine with the embedding model using CreateMLCEngine. For RAG pipelines that need both embeddings and chat completion, pass an array of model IDs containing both the embedding model and an LLM. The engine handles loading and managing multiple models simultaneously.
Single model:
- CreateMLCEngine(embeddingModelId, config) for embeddings only
Multi-model (for RAG):
- CreateMLCEngine([embeddingModelId, llmModelId], config) loads both models
- Each model is accessible via its model ID in subsequent API calls
Step 3: Format Input Documents
Prepare the text inputs according to the embedding model's expected format. Some models require specific formatting with special tokens (e.g., "[CLS] text [SEP]") and query prefixes (e.g., "Represent this sentence for searching relevant passages:"). Consult the specific model's documentation for formatting requirements.
Key considerations:
- Document formatting varies by model (some need CLS/SEP tokens, some do not)
- Query inputs may need a task-specific prefix for asymmetric retrieval
- Batch all inputs into a single array for efficient processing
- The engine handles tokenization internally after receiving formatted strings
Step 4: Generate Embeddings
Call engine.embeddings.create() with the input texts. The API follows the OpenAI embeddings format, accepting a string or array of strings and returning embedding vectors. The response includes the embedding vectors, model name, and usage statistics (total tokens processed).
What happens:
- Inputs are tokenized and batched according to the model's max batch size
- The embedding model performs forward inference via WebGPU
- Mean pooling is applied across token positions to produce fixed-size vectors
- L2 normalization is applied to produce unit vectors
- The response contains data[i].embedding arrays for each input
Step 5: Compute Similarity or Build Vector Store
Use the embedding vectors for downstream tasks. For semantic search, compute cosine similarity between query and document embeddings. For integration with LangChain.js, wrap the engine in an EmbeddingsInterface adapter and use it with MemoryVectorStore or other vector store implementations.
Key considerations:
- Cosine similarity can be computed directly on the normalized embedding vectors
- LangChain.js integration requires implementing the EmbeddingsInterface (embedQuery and embedDocuments methods)
- MemoryVectorStore provides in-memory similarity search with no external dependencies
- Similarity results can be used for document retrieval, ranking, or classification
Step 6: Build RAG Pipeline (Optional)
For retrieval-augmented generation, combine the embedding model with an LLM in a single engine. Use the embedding model to retrieve relevant documents from the vector store, format them into a context-augmented prompt, and send the prompt to the LLM for answer generation. LangChain.js RunnableSequence provides a composable API for this pipeline.
What happens:
- A retriever is created from the vector store using asRetriever()
- A prompt template combines retrieved context with the user's question
- The chain retrieves documents, formats them, and produces the final prompt
- The LLM generates an answer grounded in the retrieved context
- Both models run entirely in the browser via the same engine