Principle:Mlc ai Web llm Embedding Input Formatting

Overview

Embedding Input Formatting is the pattern of preparing text inputs with model-specific prefixes and special tokens to optimize embedding quality for different retrieval tasks. Different embedding models require different formatting conventions, and within a single model, query inputs and document inputs may require asymmetric formatting.

Description

Embedding models are trained with specific input formatting conventions that must be followed at inference time to achieve optimal retrieval quality. Deviation from these conventions degrades the semantic quality of the produced vectors and reduces retrieval accuracy.

Asymmetric Retrieval Formatting

In information retrieval, queries and documents occupy different roles:

Queries are short, question-like inputs that express an information need
Documents (or passages) are longer chunks of text that potentially contain the answer

Many modern embedding models are trained with an asymmetric dual-encoder architecture, where queries and documents are embedded using different input prefixes. This allows the model to learn distinct representations for the "seeking" and "providing" roles, improving retrieval quality.

Snowflake Arctic Embed Formatting

The Snowflake Arctic Embed models (the primary embedding models in web-llm) follow these formatting conventions:

Query Inputs: Must be prefixed with "Represent this sentence for searching relevant passages: "

Document Inputs: Require no prefix -- the raw text is used directly

Special Tokens: Both queries and documents should be wrapped with [CLS] and [SEP] tokens: "[CLS] {text} [SEP]"

These conventions are derived from the Snowflake Arctic Embed model training procedure and are demonstrated in the official web-llm embeddings example.

Why Formatting Matters

Incorrect formatting leads to measurable degradation in retrieval metrics. Specifically:

Omitting the query prefix causes query embeddings to be positioned in the same region of the vector space as documents, reducing the model's ability to distinguish between the two roles
Applying the query prefix to documents shifts document embeddings away from their natural positions, degrading document-to-document similarity
Omitting special tokens may cause the model to produce suboptimal CLS token representations, since the model expects the CLS token to be present for sentence-level pooling

Theoretical Basis

Dual-Encoder Training

Asymmetric embedding models are trained with a contrastive learning objective:

loss = -log(exp(sim(q_embed, d_pos_embed) / tau) /
           sum(exp(sim(q_embed, d_neg_embed) / tau)))

where:

q_embed is the query embedding (with query prefix applied)
d_pos_embed is the positive document embedding (without prefix)
d_neg_embed are negative document embeddings
tau is a temperature parameter
sim is typically dot product or cosine similarity

The training process teaches the model to place queries near their relevant documents in vector space, conditioned on the presence of the query prefix. Removing the prefix at inference time disrupts this learned mapping.

CLS Token Semantics

The [CLS] token (token ID 101 in most BERT-based tokenizers) is a special token whose final hidden state is designed to capture the aggregate meaning of the entire input sequence. The [SEP] token (token ID 102) marks the end of the input. The Snowflake Arctic Embed model produces its embedding from the CLS token position (index 0), making the presence of this token structurally important.

I/O Contract

Input (Query):

Raw query text: "what is snowflake?"
Formatted query: "[CLS] Represent this sentence for searching relevant passages: what is snowflake? [SEP]"

Input (Document):

Raw document text: "The Data Cloud!"
Formatted document: "[CLS] The Data Cloud! [SEP]"

Output:

The formatted strings are passed to engine.embeddings.create() for vectorization

Constraints:

The query prefix is specific to Snowflake Arctic Embed models; other embedding models may require different prefixes or no prefix at all
The total formatted input (including prefix and special tokens) must not exceed the model's context window size (512 tokens)

Usage Examples

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4");

// Define the query prefix for Snowflake Arctic Embed
const QUERY_PREFIX = "Represent this sentence for searching relevant passages: ";

// Format queries with prefix and special tokens
function formatQuery(text: string): string {
  return `[CLS] ${QUERY_PREFIX}${text} [SEP]`;
}

// Format documents with special tokens only (no prefix)
function formatDocument(text: string): string {
  return `[CLS] ${text} [SEP]`;
}

// Prepare a batch of documents
const rawDocuments = [
  "The Data Cloud!",
  "Mexico City of Course!",
  "WebGPU enables in-browser ML inference.",
];
const formattedDocuments = rawDocuments.map(formatDocument);

// Prepare queries
const rawQueries = [
  "what is snowflake?",
  "Where can I get the best tacos?",
];
const formattedQueries = rawQueries.map(formatQuery);

// Embed documents and queries separately
const docEmbeddings = await engine.embeddings.create({
  input: formattedDocuments,
});
const queryEmbeddings = await engine.embeddings.create({
  input: formattedQueries,
});

console.log("Document embeddings:", docEmbeddings.data.length);  // 3
console.log("Query embeddings:", queryEmbeddings.data.length);    // 2

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4");

// Helper class that encapsulates formatting logic for Snowflake models
class SnowflakeFormatter {
  private static readonly QUERY_PREFIX =
    "Represent this sentence for searching relevant passages: ";

  static formatForRetrieval(
    texts: string[],
    role: "query" | "document",
  ): string[] {
    return texts.map((text) => {
      if (role === "query") {
        return `[CLS] ${SnowflakeFormatter.QUERY_PREFIX}${text} [SEP]`;
      } else {
        return `[CLS] ${text} [SEP]`;
      }
    });
  }

  // For symmetric tasks (e.g., sentence similarity), use document formatting
  // for all inputs since there is no query/document distinction
  static formatForSimilarity(texts: string[]): string[] {
    return texts.map((text) => `[CLS] ${text} [SEP]`);
  }
}

// Asymmetric retrieval usage
const queries = SnowflakeFormatter.formatForRetrieval(
  ["What is WebGPU?"],
  "query",
);
const docs = SnowflakeFormatter.formatForRetrieval(
  ["WebGPU is a modern graphics API for the web."],
  "document",
);

const qEmb = await engine.embeddings.create({ input: queries });
const dEmb = await engine.embeddings.create({ input: docs });

// Compute dot product similarity
const dotProduct = qEmb.data[0].embedding.reduce(
  (sum, val, i) => sum + val * dEmb.data[0].embedding[i],
  0,
);
console.log("Similarity:", dotProduct);

Related Pages

Implementation:Mlc_ai_Web_llm_Embedding_Input_Format -- Implementation:Mlc_ai_Web_llm_Embedding_Input_Format
Principle:Mlc_ai_Web_llm_Text_Embedding_Generation -- the underlying embedding generation process
Principle:Mlc_ai_Web_llm_Cosine_Similarity_Search -- using formatted embeddings for retrieval
Principle:Mlc_ai_Web_llm_Embedding_Model_Selection -- selecting the appropriate model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment