Implementation:Mlc ai Web llm Embeddings Create
Overview
Embeddings_Create documents the concrete API surface and internal implementation for generating text embeddings in web-llm. This covers three layers: the Embeddings proxy class (OpenAI-compatible API), the MLCEngine.embedding() method (orchestration), and the EmbeddingPipeline.embedStep() method (GPU inference).
Code Reference
Layer 1: Embeddings Proxy Class
Defined in src/openai_api_protocols/embedding.ts at lines 25-38:
export class Embeddings {
private engine: MLCEngineInterface;
constructor(engine: MLCEngineInterface) {
this.engine = engine;
}
/**
* Creates an embedding vector representing the input text.
*/
create(request: EmbeddingCreateParams): Promise<CreateEmbeddingResponse> {
return this.engine.embedding(request);
}
}
This proxy class is exposed as engine.embeddings and simply delegates to engine.embedding().
Layer 2: MLCEngine.embedding()
Defined in src/engine.ts at lines 1084-1130:
async embedding(
request: EmbeddingCreateParams,
): Promise<CreateEmbeddingResponse> {
// 0. Preprocess inputs
const [selectedModelId, selectedPipeline] = this.getEmbeddingStates(
"EmbeddingCreateParams",
request.model,
);
API.postInitAndCheckFieldsEmbedding(request, selectedModelId);
// 0.5 Block wait until this pipeline finishes all previous requests
const lock = this.loadedModelIdToLock.get(selectedModelId)!;
await lock.acquire();
try {
// 1. Call EmbeddingPipeline to get embeddings
const embedResult: Array<Array<number>> =
await selectedPipeline.embedStep(request.input);
// 2. Prepare response
const batchSize = embedResult.length;
const data: Array<Embedding> = [];
for (let i = 0; i < batchSize; i++) {
const curEmbedding: Embedding = {
embedding: embedResult[i],
index: i,
object: "embedding",
};
data.push(curEmbedding);
}
return {
data: data,
model: selectedModelId,
object: "list",
usage: {
prompt_tokens: selectedPipeline.getCurRoundEmbedTotalTokens(),
total_tokens: selectedPipeline.getCurRoundEmbedTotalTokens(),
extra: {
prefill_tokens_per_s:
selectedPipeline.getCurRoundEmbedTokensPerSec(),
},
},
};
} finally {
await lock.release();
}
}
Key behaviors:
- Model resolution:
getEmbeddingStates()resolves which loaded embedding model to use, verifying it is indeed anEmbeddingPipeline - Concurrency lock: A per-model
CustomLockensures only one request is processed at a time - Usage stats: Token counts and throughput are captured from the pipeline's performance counters
Layer 3: EmbeddingPipeline.embedStep()
Defined in src/embedding.ts at lines 95-250. This is the core GPU inference method. Its logic proceeds in ordered steps:
- Input normalization: Convert all input types to
Array<Array<number>>(tokenized sequences) - Context window validation: Each sequence must not exceed
contextWindowSize - Batch loop: For each sub-batch (up to
maxBatchSizesequences):- Compute
maxInputSizefor the current batch - Pad inputs with zeros and create attention mask (1 for real tokens, 0 for padding)
- Transfer to GPU as
int32NDArrays - Call
prefill(inputNDArray, maskNDArray, params)to get logits of shape[batchSize, maxInputSize, hidden_size] - Extract
[i, 0, :]for each input (first token's hidden state)
- Compute
- Return:
Array<Array<number>>containing all embedding vectors
Request and Response Interfaces
// Request
export interface EmbeddingCreateParams {
input: string | Array<string> | Array<number> | Array<Array<number>>;
model?: string | null;
encoding_format?: "float" | "base64"; // Only "float" supported
dimensions?: number; // Not supported
user?: string; // Not supported
}
// Response
export interface CreateEmbeddingResponse {
data: Array<Embedding>;
model: string;
object: "list";
usage: CreateEmbeddingResponse.Usage;
}
export interface Embedding {
embedding: Array<number>;
index: number;
object: "embedding";
}
export namespace CreateEmbeddingResponse {
export interface Usage {
prompt_tokens: number;
total_tokens: number;
extra: {
prefill_tokens_per_s: number;
};
}
}
Validation Function
Defined in src/openai_api_protocols/embedding.ts at lines 159-198:
export function postInitAndCheckFields(
request: EmbeddingCreateParams,
currentModelId: string,
): void {
// 1. Check unsupported fields (dimensions, user)
// 2. Reject encoding_format "base64"
// 3. Validate input is not empty (string, array)
}
I/O Contract
Import:
import {
CreateMLCEngine,
EmbeddingCreateParams,
CreateEmbeddingResponse,
Embedding,
} from "@mlc-ai/web-llm";
Method signature:
engine.embeddings.create(request: EmbeddingCreateParams): Promise<CreateEmbeddingResponse>
Input constraints:
inputis required and must be non-emptymodelis optional when only one model is loaded; required when multiple models are loadedencoding_formatmust be"float"or omitted (defaults to"float")dimensionsanduserare not supported and will throwUnsupportedFieldsError
Output guarantees:
data.lengthequals the number of input texts/sequences- Each
data[i].indexequalsi data[i].embeddingis anumber[]of lengthhidden_sizeusage.prompt_tokens === usage.total_tokens(no completion tokens for embeddings)
Usage Examples
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4", {
initProgressCallback: (report) => console.log(report.text),
});
// Single string input
const result1 = await engine.embeddings.create({
input: "What is machine learning?",
});
console.log(result1.data[0].embedding.length); // 768
console.log(result1.usage.prompt_tokens); // e.g. 6
// Array of strings input (batch)
const result2 = await engine.embeddings.create({
input: [
"First document about machine learning.",
"Second document about deep learning.",
"Third document about natural language processing.",
],
});
console.log(result2.data.length); // 3
console.log(result2.data[0].embedding.length); // 768
console.log(result2.data[2].index); // 2
console.log(result2.usage.extra.prefill_tokens_per_s); // e.g. 1280.5
// Pre-tokenized input
const result3 = await engine.embeddings.create({
input: [101, 7592, 2088, 102], // Token IDs for "Hello world"
});
console.log(result3.data[0].embedding.length); // 768
// Explicit model selection (useful when multiple models loaded)
const result4 = await engine.embeddings.create({
input: "Query text",
model: "snowflake-arctic-embed-m-q0f32-MLC-b4",
});
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("snowflake-arctic-embed-m-q0f32-MLC-b4");
// Demonstrating error cases
// ERROR: Empty input
try {
await engine.embeddings.create({ input: "" });
} catch (e) {
console.error(e); // EmbeddingInputEmptyError
}
// ERROR: Unsupported encoding format
try {
await engine.embeddings.create({
input: "test",
encoding_format: "base64",
});
} catch (e) {
console.error(e); // EmbeddingUnsupportedEncodingFormatError
}
// ERROR: Unsupported fields
try {
await engine.embeddings.create({
input: "test",
dimensions: 256,
});
} catch (e) {
console.error(e); // UnsupportedFieldsError
}
Related Pages
- Principle:Mlc_ai_Web_llm_Text_Embedding_Generation -- Principle:Mlc_ai_Web_llm_Text_Embedding_Generation
- Implementation:Mlc_ai_Web_llm_Embedding_Model_Config -- model configuration and registry entries
- Implementation:Mlc_ai_Web_llm_Embedding_Input_Format -- input formatting patterns for embedding models
- Implementation:Mlc_ai_Web_llm_Cosine_Similarity_Vector_Store -- using embeddings for similarity search
- Environment:Mlc_ai_Web_llm_WebGPU_Browser_Runtime