Principle:Mlc ai Web llm Text Embedding Generation
Overview
Text Embedding Generation is the process of encoding text sequences into fixed-dimensional dense vector representations using a neural embedding model. In web-llm, this is accomplished through the EmbeddingPipeline, which handles tokenization, batched GPU inference, and result extraction to produce vectors suitable for semantic similarity computation.
Description
Text embedding generation in web-llm follows a multi-stage pipeline that transforms raw text into dense vectors:
Stage 1: Tokenization and Input Normalization
The EmbeddingPipeline.embedStep() method accepts four input types:
- A single string
- An array of strings (batch of texts)
- An array of numbers (pre-tokenized single input)
- An array of arrays of numbers (pre-tokenized batch)
All inputs are converted to a uniform Array<Array<number>> representation, where each inner array is a sequence of token IDs produced by the model's tokenizer.
Stage 2: Validation
Each tokenized input is checked against the model's context window size (typically 512 tokens for Arctic Embed models). Inputs that exceed this limit trigger an EmbeddingExceedContextWindowSizeError. Empty inputs trigger an EmbeddingInputEmptyError.
Stage 3: Batched Inference
The tokenized inputs are split into sub-batches based on the model's compiled maxBatchSize. For each sub-batch:
- Padding: Shorter sequences are padded with zeros to match the longest sequence in the batch. An attention mask is constructed:
1for real tokens,0for padding. - GPU Transfer: The padded input and attention mask are transferred to GPU as 2D
int32NDArrays of shape[batchSize, maxInputSize]. - Forward Pass: The model's
prefillfunction is called with the input tensor, attention mask, and model parameters, producing logits of shape[batchSize, maxInputSize, hidden_size]. - Result Extraction: For each input in the batch, only the first token's output vector (position
[i, 0, :]) is extracted as the embedding. This is the[CLS]token representation, which in the Snowflake Arctic Embed architecture serves as the pooled sentence representation.
Stage 4: Response Assembly
The engine wraps the raw embedding vectors into an OpenAI-compatible CreateEmbeddingResponse, including usage statistics (token counts and prefill throughput).
Theoretical Basis
Transformer Encoding
The embedding model passes each input sequence through a transformer encoder stack. Every token receives a contextualized representation that incorporates information from all other (non-padded) tokens via self-attention. The attention mask ensures padding tokens do not contribute to the representations of real tokens.
CLS Token Pooling
The Snowflake Arctic Embed models use the first token position (the [CLS] token) as the aggregate sentence representation. The EmbeddingPipeline extracts the hidden state at position 0 for each input:
embedding_i = hidden_states[i, 0, :] // shape: [hidden_size]
This is a design choice of the Snowflake models. Other embedding architectures may use mean pooling (averaging all non-padding token representations):
mean_pooling(hidden_states, mask) = sum(hidden_states * mask) / sum(mask)
L2 Normalization
Most embedding models (including Snowflake Arctic Embed) produce L2-normalized output vectors. A normalized vector has unit length:
v_norm = v / ||v||_2 ||v_norm||_2 = 1.0
L2 normalization is significant because for unit-length vectors, cosine similarity reduces to a simple dot product:
cosine_similarity(a, b) = dot(a, b) / (||a||_2 * ||b||_2)
= dot(a_norm, b_norm) // when a, b are L2-normalized
This property enables efficient similarity computation, since dot products are computationally cheaper than full cosine similarity calculations.
Embedding Dimensionality
The output dimensionality (hidden_size) is determined by the model architecture:
snowflake-arctic-embed-m: 768 dimensionssnowflake-arctic-embed-s: 384 dimensions
I/O Contract
Input:
- A string, array of strings, array of numbers (token IDs), or array of arrays of numbers (batch of token ID sequences)
- Each text/token sequence must be non-empty and must not exceed the model's context window size (512 tokens)
Output:
- An
Array<Array<number>>where each inner array is a dense vector of lengthhidden_size - The number of output vectors equals the number of input texts/sequences
Errors:
EmbeddingInputEmptyError-- raised if any input is an empty string or empty arrayEmbeddingExceedContextWindowSizeError-- raised if any tokenized input exceedscontextWindowSize
Usage Examples
import { CreateMLCEngine } from "@mlc-ai/web-llm";
// Load an embedding model
const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4", {
initProgressCallback: (report) => console.log(report.text),
});
// Generate embedding for a single text
const singleResult = await engine.embeddings.create({
input: "The quick brown fox jumps over the lazy dog.",
});
const embedding = singleResult.data[0].embedding;
console.log("Embedding dimension:", embedding.length); // 768
console.log("First 5 values:", embedding.slice(0, 5));
// Generate embeddings for a batch of texts
const batchResult = await engine.embeddings.create({
input: [
"Machine learning enables computers to learn from data.",
"Deep neural networks have revolutionized AI.",
"Natural language processing understands human text.",
],
});
console.log("Number of embeddings:", batchResult.data.length); // 3
console.log("Usage:", batchResult.usage);
// {
// prompt_tokens: 28,
// total_tokens: 28,
// extra: { prefill_tokens_per_s: 1450.32 }
// }
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4");
// Generate embeddings using pre-tokenized input
// This is useful when you have already tokenized your text
const preTokenizedResult = await engine.embeddings.create({
input: [[101, 2023, 2003, 1037, 3231, 102]], // Pre-tokenized token IDs
});
console.log("Embedding from tokens:", preTokenizedResult.data[0].embedding.length);
// Demonstrate that batch sizes exceeding maxBatchSize are automatically split
// With a -b4 model, 10 inputs are processed in 3 forward passes (4+4+2)
const largeResult = await engine.embeddings.create({
input: Array.from({ length: 10 }, (_, i) => `Sentence number ${i + 1}`),
});
console.log("Processed", largeResult.data.length, "inputs"); // 10
Related Pages
- Implementation:Mlc_ai_Web_llm_Embeddings_Create -- Implementation:Mlc_ai_Web_llm_Embeddings_Create
- Principle:Mlc_ai_Web_llm_Embedding_Model_Selection -- selecting which embedding model to load
- Principle:Mlc_ai_Web_llm_Embedding_Input_Formatting -- preparing inputs with model-specific prefixes
- Principle:Mlc_ai_Web_llm_Cosine_Similarity_Search -- using generated embeddings for retrieval