Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Web llm RAG Pipeline

From Leeroopedia

Template:Metadata

Overview

RAG Pipeline (Retrieval-Augmented Generation) is the architecture for combining text retrieval with language model generation to produce answers grounded in relevant source documents. In web-llm, this pattern runs entirely in the browser by loading both an embedding model and a generative LLM into a single engine, using the embedding model for document retrieval and the LLM for answer synthesis.

Description

Retrieval-Augmented Generation addresses a fundamental limitation of standalone LLMs: their knowledge is fixed at training time and they may hallucinate when asked about unfamiliar topics. By retrieving relevant passages from a document corpus before generation, RAG grounds the LLM's response in actual source material.

The Five-Stage RAG Pipeline in web-llm

Stage 1
Multi-Model Loading
Load both an embedding model and a generative LLM into a single MLCEngine. The engine supports loading multiple models via CreateMLCEngine([embeddingModelId, llmModelId]). Models are loaded sequentially and each gets its own pipeline, configuration, and concurrency lock.
Stage 2
Document Indexing
Generate embeddings for all documents in the knowledge base using the embedding model. Store the resulting vectors in a vector store (e.g., LangChain MemoryVectorStore or a custom in-memory index). This step is done once per corpus and can be performed incrementally.
Stage 3
Query Embedding
When a user asks a question, generate an embedding for the query using the same embedding model (with appropriate query prefix formatting).
Stage 4
Retrieval
Perform similarity search to find the top-k most relevant documents. The query embedding is compared against all document embeddings using cosine similarity (dot product for L2-normalized vectors).
Stage 5
Augmented Generation
Construct a prompt that includes the retrieved documents as context, followed by the user's question. Pass this augmented prompt to the LLM via engine.chat.completions.create(), specifying the LLM's model ID. The LLM generates a response grounded in the retrieved context.

Key Architectural Properties

Fully Client-Side
The entire RAG pipeline runs in the browser. No server is required for inference, retrieval, or generation. This provides strong privacy guarantees since no data leaves the user's device.
Multi-Model Engine
web-llm's MLCEngine supports loading multiple models simultaneously. Each model is assigned its own pipeline type (EmbeddingPipeline or LLMChatPipeline) based on its model_type. API calls specify which model to use via the model parameter.
OpenAI-Compatible API
Both the embedding and chat completion APIs follow OpenAI's interface conventions, making it straightforward to port RAG pipelines from server-side OpenAI code to in-browser web-llm code.
Memory Constraints
Running both an embedding model and an LLM simultaneously requires sufficient GPU VRAM. The total VRAM requirement is approximately the sum of both models' vram_required_MB values. Typical combinations:
  • snowflake-arctic-embed-s-q0f32-MLC-b4 (239 MB) + Llama-3.2-1B-Instruct-q4f32_1-MLC (1129 MB) ≈ 1.4 GB
  • snowflake-arctic-embed-m-q0f32-MLC-b4 (539 MB) + gemma-2-2b-it-q4f32_1-MLC-1k (1751 MB) ≈ 2.3 GB

Theoretical Basis

Retrieval-Augmented Generation Framework

RAG was introduced by Lewis et al. (2020) as a method to combine parametric knowledge (in the LLM's weights) with non-parametric knowledge (in a retrieval index). The generative model is conditioned on retrieved documents:

P(answer | question) = P(answer | question, retrieved_docs)
                     ≈ P_LLM(answer | prompt(question, top_k_docs))

where prompt(question, top_k_docs) constructs a text prompt containing the retrieved context.

Prompt Construction

The standard RAG prompt template places retrieved context before the question:

Answer the question based only on the following context:
{context}

Question: {question}

The instruction "based only on the following context" discourages the LLM from relying on its parametric knowledge, reducing hallucination for questions covered by the retrieved documents.

Retrieval Quality Impact

The quality of the final answer depends critically on retrieval quality:

  • High recall ensures the relevant passage is included in the context
  • High precision avoids diluting the context with irrelevant passages
  • The parameter k (number of retrieved documents) trades off between recall and context length -- larger k increases recall but may exceed the LLM's context window or dilute attention

I/O Contract

Input:

  • A corpus of document texts to be indexed
  • A user question/query string
  • Model IDs for the embedding model and the LLM

Output:

  • A generated text answer grounded in the retrieved documents

Intermediate Artifacts:

  • Document embeddings stored in a vector store
  • Query embedding (a single vector)
  • Retrieved top-k documents (as text)
  • An augmented prompt combining context and question

Constraints:

  • Both models must be loadable in available GPU VRAM
  • The augmented prompt (context + question) must fit within the LLM's context window
  • The embedding model and LLM must be specified by their respective model_id values when calling APIs

Usage Examples

import * as webllm from "@mlc-ai/web-llm";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import type { EmbeddingsInterface } from "@langchain/core/embeddings";

// LangChain adapter for web-llm embeddings
class WebLLMEmbeddings implements EmbeddingsInterface {
  engine: webllm.MLCEngineInterface;
  modelId: string;
  constructor(engine: webllm.MLCEngineInterface, modelId: string) {
    this.engine = engine;
    this.modelId = modelId;
  }

  async embedQuery(text: string): Promise<number[]> {
    const reply = await this.engine.embeddings.create({
      input: [text],
      model: this.modelId,
    });
    return reply.data[0].embedding;
  }

  async embedDocuments(texts: string[]): Promise<number[][]> {
    const reply = await this.engine.embeddings.create({
      input: texts,
      model: this.modelId,
    });
    return reply.data.map((d) => d.embedding);
  }
}

// Stage 1: Load both models
const embeddingModelId = "snowflake-arctic-embed-m-q0f32-MLC-b4";
const llmModelId = "gemma-2-2b-it-q4f32_1-MLC-1k";
const engine = await webllm.CreateMLCEngine(
  [embeddingModelId, llmModelId],
  {
    initProgressCallback: (report) => console.log(report.text),
    logLevel: "INFO",
  },
);

// Stage 2: Index documents
const vectorStore = await MemoryVectorStore.fromTexts(
  ["mitochondria is the powerhouse of the cell"],
  [{ id: 1 }],
  new WebLLMEmbeddings(engine, embeddingModelId),
);

// Stage 3 & 4: Query and retrieve
const retriever = vectorStore.asRetriever();
const relevantDocs = await retriever.invoke(
  "What is the powerhouse of the cell?",
);
const context = relevantDocs.map((d) => d.pageContent).join("\n");

// Stage 5: Generate answer with retrieved context
const prompt = `Answer the question based only on the following context:
${context}

Question: What is the powerhouse of the cell?`;

const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: prompt }],
  model: llmModelId,
});

console.log(reply.choices[0].message.content);
// Expected: "The powerhouse of the cell is the mitochondria."
import * as webllm from "@mlc-ai/web-llm";

// Minimal RAG without LangChain dependency
const embModelId = "snowflake-arctic-embed-s-q0f32-MLC-b4";
const llmId = "Llama-3.2-1B-Instruct-q4f32_1-MLC";
const engine = await webllm.CreateMLCEngine([embModelId, llmId]);

const QUERY_PREFIX =
  "Represent this sentence for searching relevant passages: ";

// Build a simple document index
const documents = [
  "WebGPU provides low-level GPU access from the browser.",
  "WebAssembly enables near-native code execution in browsers.",
  "The Fetch API provides a modern interface for making HTTP requests.",
];
const formattedDocs = documents.map((d) => `[CLS] ${d} [SEP]`);
const docEmb = await engine.embeddings.create({
  input: formattedDocs,
  model: embModelId,
});

// Compute dot product similarity
function dotProduct(a: number[], b: number[]): number {
  let sum = 0;
  for (let i = 0; i < a.length; i++) sum += a[i] * b[i];
  return sum;
}

// Retrieve top-k documents for a query
async function retrieve(query: string, k: number): Promise<string[]> {
  const formatted = `[CLS] ${QUERY_PREFIX}${query} [SEP]`;
  const qEmb = await engine.embeddings.create({
    input: formatted,
    model: embModelId,
  });
  const qVec = qEmb.data[0].embedding;

  const scores = documents.map((doc, i) => ({
    doc,
    score: dotProduct(qVec, docEmb.data[i].embedding),
  }));
  scores.sort((a, b) => b.score - a.score);
  return scores.slice(0, k).map((s) => s.doc);
}

// Run RAG
const question = "How can I access the GPU from a web browser?";
const topDocs = await retrieve(question, 2);
const context = topDocs.join("\n- ");

const answer = await engine.chat.completions.create({
  messages: [
    {
      role: "user",
      content: `Based on the following information:\n- ${context}\n\nAnswer: ${question}`,
    },
  ],
  model: llmId,
});
console.log(answer.choices[0].message.content);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment