Workflow:Mlc ai Web llm Basic Chat Completion
| Knowledge Sources | |
|---|---|
| Domains | LLMs, WebGPU, Browser_Inference |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for running large language model chat completion directly in the browser using WebGPU, with an OpenAI-compatible API.
Description
This workflow covers the fundamental use case of web-llm: loading a pre-compiled LLM into the browser and performing chat completion inference. The engine downloads model weights and WASM runtime artifacts (caching them in the browser for subsequent loads), initializes the TVM WebGPU runtime, and exposes an OpenAI-compatible chat.completions API. Both non-streaming (full response at once) and streaming (delta chunks via AsyncGenerator) modes are supported. The workflow also covers generation parameters such as temperature, top-p, max tokens, logit bias, and logprobs.
Usage
Execute this workflow when you need to run a conversational LLM entirely in the user's browser without any server-side inference infrastructure. This is the starting point for any web-llm integration: you have a web application and want to add local, private LLM chat capabilities using WebGPU acceleration.
Execution Steps
Step 1: Select a Model
Choose a model from the web-llm pre-built model registry. The registry contains hundreds of pre-configured models (Llama, Phi, Gemma, Qwen, Mistral, SmolLM, etc.) with associated WASM library URLs and VRAM requirements. Each model entry specifies the HuggingFace URL, model ID, WASM library path, and optional overrides like context window size.
Key considerations:
- Check the model's VRAM requirement against the user's GPU capacity
- Models are identified by their model ID string (e.g., "Llama-3.1-8B-Instruct-q4f32_1-MLC")
- Quantized variants (q4f16, q4f32) reduce memory requirements at the cost of some quality
- Custom models can be registered via an AppConfig object with model_list entries
Step 2: Create the Engine
Initialize the MLCEngine using the CreateMLCEngine factory function. This triggers the full model loading pipeline: checking the browser cache for existing artifacts, downloading missing WASM libraries and model weights from the configured URLs, and setting up the TVM WebGPU runtime. A progress callback function can be provided to display loading status to the user.
What happens:
- The factory function resolves the model ID against the model registry
- WASM runtime library and model weight shards are fetched (or loaded from Cache API / IndexedDB)
- The TVM runtime is initialized with WebGPU backend
- An LLMChatPipeline is created for inference
- The returned engine implements the full MLCEngineInterface
Step 3: Configure the Request
Build a ChatCompletionRequest object following the OpenAI Chat Completion API format. The request includes the conversation messages array (system, user, assistant roles), generation parameters, and optional features like logprobs or logit bias.
Key considerations:
- Messages follow the OpenAI format: array of objects with role and content fields
- Set stream: true for streaming mode, or omit/set false for non-streaming
- Optional parameters include temperature, top_p, max_tokens, n (number of completions), presence_penalty, frequency_penalty
- Logit bias can steer generation by boosting or suppressing specific token IDs
- Set logprobs: true and top_logprobs for token probability inspection
Step 4: Execute Inference
Call engine.chat.completions.create() with the request object. In non-streaming mode, this returns a complete ChatCompletion response object. In streaming mode, it returns an AsyncGenerator that yields ChatCompletionChunk objects as tokens are generated.
Non-streaming mode:
- Returns a single ChatCompletion object with choices array
- Each choice contains the full message content and finish reason
- Usage statistics (prompt tokens, completion tokens) are included
Streaming mode:
- Returns an AsyncGenerator of ChatCompletionChunk objects
- Each chunk contains a delta with incremental content
- Usage statistics are included in the final chunk when stream_options.include_usage is true
- Iterate with a for await loop to process chunks as they arrive
Step 5: Process the Response
Extract the generated text from the response and present it to the user. For non-streaming responses, read the message content directly from the choices array. For streaming responses, concatenate the delta content from each chunk. Optionally inspect usage statistics for performance monitoring.
Key considerations:
- Non-streaming: access via reply.choices[0].message.content
- Streaming: accumulate chunk.choices[0].delta.content across iterations
- engine.getMessage() can retrieve the full concatenated message after streaming completes
- Usage object reports prompt_tokens, completion_tokens, and total_tokens
- The engine can be reused for multiple requests without reloading