Principle:Ggml org Ggml WebGPU Computation
| Field | Value |
|---|---|
| sources | GGML WebGPU Specification WGSL Specification |
| domains | GPU, Web, WebGPU |
| last_updated | 2026-02-10 |
Overview
WebGPU Computation is the principle of enabling GPU-accelerated tensor inference in web browsers and native applications via the WebGPU API, using WGSL (WebGPU Shading Language) compute shaders for portable, sandboxed GPU access.
Description
WebGPU is a modern graphics and compute API designed as the successor to WebGL. Unlike WebGL (which was based on OpenGL ES), WebGPU provides explicit compute shader support, making it suitable for general-purpose GPU computation including neural network inference. WebGPU is available both in web browsers (Chrome, Firefox, Safari) via JavaScript/WebAssembly and in native applications via implementations like Dawn (Google) and wgpu (Mozilla).
The GGML WebGPU backend enables inference directly in web browsers, bringing large language model inference to any device with a modern browser and GPU -- without requiring native installations, GPU driver management, or platform-specific builds.
WGSL Compute Shaders
WebGPU compute shaders are written in WGSL (WebGPU Shading Language), a new shading language designed specifically for WebGPU. WGSL has a Rust-inspired syntax with explicit type annotations and memory qualifiers. The GGML backend stores WGSL shader source in the wgsl-shaders/ directory and compiles them at runtime via the WebGPU API.
Key WGSL features used by the backend:
- @compute @workgroup_size(x, y, z) -- Declares a compute shader entry point with specified workgroup dimensions
- storage buffers -- Read/write access to GPU-side tensor data
- workgroup (shared) memory -- Fast shared memory within a workgroup for cooperative algorithms
- subgroup operations -- (where supported) For efficient reductions within a GPU wavefront
Shader Library Architecture
The backend uses a preprocessor (pre_wgsl.hpp) that performs compile-time processing of WGSL shader sources, and a shader library (ggml-webgpu-shader-lib.hpp) that manages the collection of compiled shaders. This allows efficient shader variant management for different quantization types and operation configurations.
Device and Pipeline Management
The WebGPU execution model involves:
- wgpu::Device -- Represents the GPU device, obtained via adapter request
- wgpu::ComputePipeline -- A compiled compute pipeline (shader + layout)
- wgpu::BindGroup -- Binds GPU buffers to shader resource slots
- wgpu::CommandEncoder -- Records GPU commands (dispatches, copies)
- wgpu::Queue -- Submits encoded command buffers for execution
Emscripten Integration
When compiled for the web via Emscripten, the backend uses the Emscripten WebGPU bindings, which map C++ WebGPU calls to the browser's JavaScript WebGPU API. This enables the same C++ codebase to run both natively and in-browser.
Usage
Apply WebGPU computation when:
- Running inference in a web browser (Chrome, Firefox, Safari with WebGPU support)
- Building cross-platform applications that need GPU acceleration without platform-specific GPU APIs
- Privacy-sensitive deployments where model inference must happen client-side in the browser
- Ease of distribution is important (no native installation required -- just a web page)
WebGPU is particularly valuable for:
- Web-based AI applications -- Chatbots, text generation, and other LLM applications running entirely in-browser
- Edge inference -- Running models on end-user devices without server infrastructure
- Prototyping -- Quick iteration without compiling native GPU backends
Limitations to consider:
- WebGPU's sandboxed model imposes some overhead compared to native Vulkan or Metal
- Browser implementations may have buffer size limits and other restrictions
- Not all quantization types or operations may be supported
- Performance may vary across browsers and GPU vendors
Theoretical Basis
The WebGPU execution model for GGML tensor operations:
Initialization:
1. Request adapter:
wgpu::Adapter adapter = instance.requestAdapter(options)
-- Browser selects appropriate GPU (integrated or discrete)
2. Request device:
wgpu::Device device = adapter.requestDevice(descriptor)
-- descriptor specifies required limits (max buffer size, workgroup size, etc.)
3. Compile shader modules:
For each WGSL shader source:
wgpu::ShaderModule module = device.createShaderModule({code: wgsl_source})
4. Create compute pipelines:
For each operation variant:
wgpu::ComputePipeline pipeline = device.createComputePipeline({
compute: {module: shader_module, entryPoint: "main"}
})
5. Allocate GPU buffers:
wgpu::Buffer buf = device.createBuffer({
size: tensor_size,
usage: Storage | CopyDst | CopySrc
})
Graph Execution:
1. Create command encoder:
wgpu::CommandEncoder encoder = device.createCommandEncoder()
2. For each node in the computation graph:
a. Create bind group:
wgpu::BindGroup bindGroup = device.createBindGroup({
entries: [
{binding: 0, resource: {buffer: src0_buffer}},
{binding: 1, resource: {buffer: src1_buffer}},
{binding: 2, resource: {buffer: dst_buffer}},
{binding: 3, resource: {buffer: params_buffer}}
]
})
b. Begin compute pass:
wgpu::ComputePassEncoder pass = encoder.beginComputePass()
c. Dispatch:
pass.setPipeline(pipeline)
pass.setBindGroup(0, bindGroup)
pass.dispatchWorkgroups(
ceil(N / workgroup_x),
ceil(M / workgroup_y),
batch_size
)
pass.end()
3. Submit:
wgpu::CommandBuffer commands = encoder.finish()
device.getQueue().submit(1, &commands)
4. Readback (if needed):
-- Map output buffer for reading
dst_buffer.mapAsync(wgpu::MapMode::Read, 0, size, callback)
-- In callback: memcpy from getMappedRange() to host buffer
dst_buffer.unmap()
WGSL Shader (conceptual matrix multiply): @group(0) @binding(0) var<storage, read> A : array<f32>; @group(0) @binding(1) var<storage, read> B : array<f32>; @group(0) @binding(2) var<storage, read_write> C : array<f32>;
struct Params { M: u32, N: u32, K: u32 }
@group(0) @binding(3) var<uniform> params : Params;
var<workgroup> tile_A : array<f32, TILE_SIZE>; var<workgroup> tile_B : array<f32, TILE_SIZE>;
@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) gid : vec3<u32>,
@builtin(local_invocation_id) lid : vec3<u32>) {
let row = gid.y;
let col = gid.x;
var sum : f32 = 0.0;
for (var k : u32 = 0u; k < params.K; k = k + TILE_K) {
// Cooperative tile loading
tile_A[lid.y * TILE_K + lid.x] = A[row * params.K + k + lid.x];
tile_B[lid.y * TILE_K + lid.x] = B[(k + lid.y) * params.N + col];
workgroupBarrier();
for (var t : u32 = 0u; t < TILE_K; t = t + 1u) {
sum = sum + tile_A[lid.y * TILE_K + t] * tile_B[t * 16u + lid.x];
}
workgroupBarrier();
}
C[row * params.N + col] = sum; }
Related Pages
- Implementation:Ggml_org_Ggml_Webgpu_backend
- Ggml_org_Ggml_Webgpu_backend -- The backend implementation that applies this principle
- Ggml_org_Ggml_Vulkan_GPU_Computation -- Native GPU compute alternative using Vulkan
- Ggml_org_Ggml_OpenCL_GPU_Computation -- Alternative cross-platform GPU compute using OpenCL
- Ggml_org_Ggml_CPU_Compute_Engine -- CPU fallback used alongside WebGPU