Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Mlc ai Web llm Basic Chat Completion

From Leeroopedia
Knowledge Sources
Domains LLMs, WebGPU, Browser_Inference
Last Updated 2026-02-14 22:00 GMT

Overview

End-to-end process for running large language model chat completion directly in the browser using WebGPU, with an OpenAI-compatible API.

Description

This workflow covers the fundamental use case of web-llm: loading a pre-compiled LLM into the browser and performing chat completion inference. The engine downloads model weights and WASM runtime artifacts (caching them in the browser for subsequent loads), initializes the TVM WebGPU runtime, and exposes an OpenAI-compatible chat.completions API. Both non-streaming (full response at once) and streaming (delta chunks via AsyncGenerator) modes are supported. The workflow also covers generation parameters such as temperature, top-p, max tokens, logit bias, and logprobs.

Usage

Execute this workflow when you need to run a conversational LLM entirely in the user's browser without any server-side inference infrastructure. This is the starting point for any web-llm integration: you have a web application and want to add local, private LLM chat capabilities using WebGPU acceleration.

Execution Steps

Step 1: Select a Model

Choose a model from the web-llm pre-built model registry. The registry contains hundreds of pre-configured models (Llama, Phi, Gemma, Qwen, Mistral, SmolLM, etc.) with associated WASM library URLs and VRAM requirements. Each model entry specifies the HuggingFace URL, model ID, WASM library path, and optional overrides like context window size.

Key considerations:

  • Check the model's VRAM requirement against the user's GPU capacity
  • Models are identified by their model ID string (e.g., "Llama-3.1-8B-Instruct-q4f32_1-MLC")
  • Quantized variants (q4f16, q4f32) reduce memory requirements at the cost of some quality
  • Custom models can be registered via an AppConfig object with model_list entries

Step 2: Create the Engine

Initialize the MLCEngine using the CreateMLCEngine factory function. This triggers the full model loading pipeline: checking the browser cache for existing artifacts, downloading missing WASM libraries and model weights from the configured URLs, and setting up the TVM WebGPU runtime. A progress callback function can be provided to display loading status to the user.

What happens:

  • The factory function resolves the model ID against the model registry
  • WASM runtime library and model weight shards are fetched (or loaded from Cache API / IndexedDB)
  • The TVM runtime is initialized with WebGPU backend
  • An LLMChatPipeline is created for inference
  • The returned engine implements the full MLCEngineInterface

Step 3: Configure the Request

Build a ChatCompletionRequest object following the OpenAI Chat Completion API format. The request includes the conversation messages array (system, user, assistant roles), generation parameters, and optional features like logprobs or logit bias.

Key considerations:

  • Messages follow the OpenAI format: array of objects with role and content fields
  • Set stream: true for streaming mode, or omit/set false for non-streaming
  • Optional parameters include temperature, top_p, max_tokens, n (number of completions), presence_penalty, frequency_penalty
  • Logit bias can steer generation by boosting or suppressing specific token IDs
  • Set logprobs: true and top_logprobs for token probability inspection

Step 4: Execute Inference

Call engine.chat.completions.create() with the request object. In non-streaming mode, this returns a complete ChatCompletion response object. In streaming mode, it returns an AsyncGenerator that yields ChatCompletionChunk objects as tokens are generated.

Non-streaming mode:

  • Returns a single ChatCompletion object with choices array
  • Each choice contains the full message content and finish reason
  • Usage statistics (prompt tokens, completion tokens) are included

Streaming mode:

  • Returns an AsyncGenerator of ChatCompletionChunk objects
  • Each chunk contains a delta with incremental content
  • Usage statistics are included in the final chunk when stream_options.include_usage is true
  • Iterate with a for await loop to process chunks as they arrive

Step 5: Process the Response

Extract the generated text from the response and present it to the user. For non-streaming responses, read the message content directly from the choices array. For streaming responses, concatenate the delta content from each chunk. Optionally inspect usage statistics for performance monitoring.

Key considerations:

  • Non-streaming: access via reply.choices[0].message.content
  • Streaming: accumulate chunk.choices[0].delta.content across iterations
  • engine.getMessage() can retrieve the full concatenated message after streaming completes
  • Usage object reports prompt_tokens, completion_tokens, and total_tokens
  • The engine can be reused for multiple requests without reloading

Execution Diagram

GitHub URL

Workflow Repository