Workflow:Mlc ai Web llm Basic Chat Completion

Knowledge Sources	web-llm WebLLM Docs Basic Usage Guide
Domains	LLMs, WebGPU, Browser_Inference
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for running large language model chat completion directly in the browser using WebGPU, with an OpenAI-compatible API.

Description

This workflow covers the fundamental use case of web-llm: loading a pre-compiled LLM into the browser and performing chat completion inference. The engine downloads model weights and WASM runtime artifacts (caching them in the browser for subsequent loads), initializes the TVM WebGPU runtime, and exposes an OpenAI-compatible chat.completions API. Both non-streaming (full response at once) and streaming (delta chunks via AsyncGenerator) modes are supported. The workflow also covers generation parameters such as temperature, top-p, max tokens, logit bias, and logprobs.

Usage

Execute this workflow when you need to run a conversational LLM entirely in the user's browser without any server-side inference infrastructure. This is the starting point for any web-llm integration: you have a web application and want to add local, private LLM chat capabilities using WebGPU acceleration.

Execution Steps

Step 1: Select a Model

Choose a model from the web-llm pre-built model registry. The registry contains hundreds of pre-configured models (Llama, Phi, Gemma, Qwen, Mistral, SmolLM, etc.) with associated WASM library URLs and VRAM requirements. Each model entry specifies the HuggingFace URL, model ID, WASM library path, and optional overrides like context window size.

Key considerations:

Check the model's VRAM requirement against the user's GPU capacity
Models are identified by their model ID string (e.g., "Llama-3.1-8B-Instruct-q4f32_1-MLC")
Quantized variants (q4f16, q4f32) reduce memory requirements at the cost of some quality
Custom models can be registered via an AppConfig object with model_list entries

Step 2: Create the Engine

Initialize the MLCEngine using the CreateMLCEngine factory function. This triggers the full model loading pipeline: checking the browser cache for existing artifacts, downloading missing WASM libraries and model weights from the configured URLs, and setting up the TVM WebGPU runtime. A progress callback function can be provided to display loading status to the user.

What happens:

The factory function resolves the model ID against the model registry
WASM runtime library and model weight shards are fetched (or loaded from Cache API / IndexedDB)
The TVM runtime is initialized with WebGPU backend
An LLMChatPipeline is created for inference
The returned engine implements the full MLCEngineInterface

Step 3: Configure the Request

Build a ChatCompletionRequest object following the OpenAI Chat Completion API format. The request includes the conversation messages array (system, user, assistant roles), generation parameters, and optional features like logprobs or logit bias.

Key considerations:

Messages follow the OpenAI format: array of objects with role and content fields
Set stream: true for streaming mode, or omit/set false for non-streaming
Optional parameters include temperature, top_p, max_tokens, n (number of completions), presence_penalty, frequency_penalty
Logit bias can steer generation by boosting or suppressing specific token IDs
Set logprobs: true and top_logprobs for token probability inspection

Step 4: Execute Inference

Call engine.chat.completions.create() with the request object. In non-streaming mode, this returns a complete ChatCompletion response object. In streaming mode, it returns an AsyncGenerator that yields ChatCompletionChunk objects as tokens are generated.

Non-streaming mode:

Returns a single ChatCompletion object with choices array
Each choice contains the full message content and finish reason
Usage statistics (prompt tokens, completion tokens) are included

Streaming mode:

Returns an AsyncGenerator of ChatCompletionChunk objects
Each chunk contains a delta with incremental content
Usage statistics are included in the final chunk when stream_options.include_usage is true
Iterate with a for await loop to process chunks as they arrive

Step 5: Process the Response

Extract the generated text from the response and present it to the user. For non-streaming responses, read the message content directly from the choices array. For streaming responses, concatenate the delta content from each chunk. Optionally inspect usage statistics for performance monitoring.

Key considerations:

Non-streaming: access via reply.choices[0].message.content
Streaming: accumulate chunk.choices[0].delta.content across iterations
engine.getMessage() can retrieve the full concatenated message after streaming completes
Usage object reports prompt_tokens, completion_tokens, and total_tokens
The engine can be reused for multiple requests without reloading

Execution Diagram

GitHub URL

Workflow Repository