Principle:Mlc ai Web llm Text Embedding Generation

Overview

Text Embedding Generation is the process of encoding text sequences into fixed-dimensional dense vector representations using a neural embedding model. In web-llm, this is accomplished through the EmbeddingPipeline, which handles tokenization, batched GPU inference, and result extraction to produce vectors suitable for semantic similarity computation.

Description

Text embedding generation in web-llm follows a multi-stage pipeline that transforms raw text into dense vectors:

Stage 1: Tokenization and Input Normalization

The EmbeddingPipeline.embedStep() method accepts four input types:

A single string
An array of strings (batch of texts)
An array of numbers (pre-tokenized single input)
An array of arrays of numbers (pre-tokenized batch)

All inputs are converted to a uniform Array<Array<number>> representation, where each inner array is a sequence of token IDs produced by the model's tokenizer.

Stage 2: Validation

Each tokenized input is checked against the model's context window size (typically 512 tokens for Arctic Embed models). Inputs that exceed this limit trigger an EmbeddingExceedContextWindowSizeError. Empty inputs trigger an EmbeddingInputEmptyError.

Stage 3: Batched Inference

The tokenized inputs are split into sub-batches based on the model's compiled maxBatchSize. For each sub-batch:

Padding: Shorter sequences are padded with zeros to match the longest sequence in the batch. An attention mask is constructed: 1 for real tokens, 0 for padding.
GPU Transfer: The padded input and attention mask are transferred to GPU as 2D int32 NDArrays of shape [batchSize, maxInputSize].
Forward Pass: The model's prefill function is called with the input tensor, attention mask, and model parameters, producing logits of shape [batchSize, maxInputSize, hidden_size].
Result Extraction: For each input in the batch, only the first token's output vector (position [i, 0, :]) is extracted as the embedding. This is the [CLS] token representation, which in the Snowflake Arctic Embed architecture serves as the pooled sentence representation.

Stage 4: Response Assembly

The engine wraps the raw embedding vectors into an OpenAI-compatible CreateEmbeddingResponse, including usage statistics (token counts and prefill throughput).

Theoretical Basis

Transformer Encoding

The embedding model passes each input sequence through a transformer encoder stack. Every token receives a contextualized representation that incorporates information from all other (non-padded) tokens via self-attention. The attention mask ensures padding tokens do not contribute to the representations of real tokens.

CLS Token Pooling

The Snowflake Arctic Embed models use the first token position (the [CLS] token) as the aggregate sentence representation. The EmbeddingPipeline extracts the hidden state at position 0 for each input:

embedding_i = hidden_states[i, 0, :]   // shape: [hidden_size]

This is a design choice of the Snowflake models. Other embedding architectures may use mean pooling (averaging all non-padding token representations):

mean_pooling(hidden_states, mask) = sum(hidden_states * mask) / sum(mask)

L2 Normalization

Most embedding models (including Snowflake Arctic Embed) produce L2-normalized output vectors. A normalized vector has unit length:

v_norm = v / ||v||_2
||v_norm||_2 = 1.0

L2 normalization is significant because for unit-length vectors, cosine similarity reduces to a simple dot product:

cosine_similarity(a, b) = dot(a, b) / (||a||_2 * ||b||_2)
                        = dot(a_norm, b_norm)    // when a, b are L2-normalized

This property enables efficient similarity computation, since dot products are computationally cheaper than full cosine similarity calculations.

Embedding Dimensionality

The output dimensionality (hidden_size) is determined by the model architecture:

snowflake-arctic-embed-m: 768 dimensions
snowflake-arctic-embed-s: 384 dimensions

I/O Contract

Input:

A string, array of strings, array of numbers (token IDs), or array of arrays of numbers (batch of token ID sequences)
Each text/token sequence must be non-empty and must not exceed the model's context window size (512 tokens)

Output:

An Array<Array<number>> where each inner array is a dense vector of length hidden_size
The number of output vectors equals the number of input texts/sequences

Errors:

EmbeddingInputEmptyError -- raised if any input is an empty string or empty array
EmbeddingExceedContextWindowSizeError -- raised if any tokenized input exceeds contextWindowSize

Usage Examples

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Load an embedding model
const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4", {
  initProgressCallback: (report) => console.log(report.text),
});

// Generate embedding for a single text
const singleResult = await engine.embeddings.create({
  input: "The quick brown fox jumps over the lazy dog.",
});
const embedding = singleResult.data[0].embedding;
console.log("Embedding dimension:", embedding.length);  // 768
console.log("First 5 values:", embedding.slice(0, 5));

// Generate embeddings for a batch of texts
const batchResult = await engine.embeddings.create({
  input: [
    "Machine learning enables computers to learn from data.",
    "Deep neural networks have revolutionized AI.",
    "Natural language processing understands human text.",
  ],
});
console.log("Number of embeddings:", batchResult.data.length);  // 3
console.log("Usage:", batchResult.usage);
// {
//   prompt_tokens: 28,
//   total_tokens: 28,
//   extra: { prefill_tokens_per_s: 1450.32 }
// }

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4");

// Generate embeddings using pre-tokenized input
// This is useful when you have already tokenized your text
const preTokenizedResult = await engine.embeddings.create({
  input: [[101, 2023, 2003, 1037, 3231, 102]],  // Pre-tokenized token IDs
});
console.log("Embedding from tokens:", preTokenizedResult.data[0].embedding.length);

// Demonstrate that batch sizes exceeding maxBatchSize are automatically split
// With a -b4 model, 10 inputs are processed in 3 forward passes (4+4+2)
const largeResult = await engine.embeddings.create({
  input: Array.from({ length: 10 }, (_, i) => `Sentence number ${i + 1}`),
});
console.log("Processed", largeResult.data.length, "inputs");  // 10

Related Pages

Implementation:Mlc_ai_Web_llm_Embeddings_Create -- Implementation:Mlc_ai_Web_llm_Embeddings_Create
Principle:Mlc_ai_Web_llm_Embedding_Model_Selection -- selecting which embedding model to load
Principle:Mlc_ai_Web_llm_Embedding_Input_Formatting -- preparing inputs with model-specific prefixes
Principle:Mlc_ai_Web_llm_Cosine_Similarity_Search -- using generated embeddings for retrieval

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment