Implementation:Ggml org Ggml Webgpu shader lib
| File Name | src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
|
| Repository | ggml-org/ggml |
| Lines | 537 |
| Language | C++ |
| Domain Tags | GPU_Computing, Shader_Management, WebGPU |
| Status | Active |
| Last Updated | 2025-05-15 12:00 GMT |
| Knowledge Sources | ggml-org/ggml repository |
Overview
ggml-webgpu-shader-lib.hpp is the shader library for the WebGPU backend, providing pipeline key types, shader processing infrastructure, and workgroup configuration logic for all supported operations. It is the shader specialization layer that enables optimal GPU utilization by selecting appropriate shader variants based on operation parameters and hardware capabilities.
Description
The file defines pipeline key structs for each operation category with corresponding hash functions for use in unordered_maps. Each key captures the parameters that differentiate shader variants:
- Flash Attention --
ggml_webgpu_flash_attn_pipeline_keyencodes KV type, head dimensions (QK and V), KV direct access, mask presence, sink tokens, and logit softcap - Generic Operations --
ggml_webgpu_generic_shader_lib_contextfor standard operations - Pad, Argsort, Set-Rows, Unary, Binary -- Specialized key types for each operation category
The ggml_webgpu_processed_shader struct holds the processed WGSL code, variant name, and decision parameters. Decision structs encode runtime choices like tile sizes (q_tile, kv_tile), workgroup sizes, and subgroup matrix dimensions.
Key constants include:
GGML_WEBGPU_FLASH_ATTN_PREFERRED_KV_SG_TILES = 8GGML_WEBGPU_FLASH_ATTN_PREFERRED_WG_SIZE = 128GGML_WEBGPU_KV_SEQ_PAD = 256(matches GGML_PAD in llama-context.cpp)GGML_WEBGPU_ARGSORT_MERGE_MAX_WG_SIZE = 512
Usage
This library is used internally by ggml-webgpu.cpp to configure and cache shader pipelines.
#include "ggml-webgpu-shader-lib.hpp"
// Create a flash attention pipeline key
ggml_webgpu_flash_attn_pipeline_key key = {
.kv_type = GGML_TYPE_F16,
.head_dim_qk = 128,
.head_dim_v = 128,
.kv_direct = false,
.has_mask = true,
};
// Compute workgroup memory requirements
size_t wg_mem = ggml_webgpu_flash_attn_wg_mem_bytes(q_tile, kv_tile,
key.head_dim_qk, key.head_dim_v, key.has_mask, key.kv_direct);
Code Reference
Source Location
| Repository | File | Lines |
|---|---|---|
| ggml-org/ggml | src/ggml-webgpu/ggml-webgpu-shader-lib.hpp |
537 |
Key Signatures
struct ggml_webgpu_processed_shader {
std::string wgsl;
std::string variant;
void * decisions;
};
struct ggml_webgpu_flash_attn_pipeline_key {
ggml_type kv_type;
uint32_t head_dim_qk;
uint32_t head_dim_v;
bool kv_direct, has_mask, has_sinks, uses_logit_softcap;
};
struct ggml_webgpu_flash_attn_shader_decisions {
uint32_t q_tile = 0;
uint32_t kv_tile = 0;
uint32_t wg_size = 0;
};
inline size_t ggml_webgpu_flash_attn_wg_mem_bytes(uint32_t q_tile, uint32_t kv_tile,
uint32_t head_dim_qk, uint32_t head_dim_v, bool has_mask, bool kv_direct);
template <typename T> inline void ggml_webgpu_hash_combine(size_t & seed, const T & value);
I/O Contract
Inputs
- Pipeline keys -- Operation parameters that determine shader variant selection
- Hardware capabilities -- Subgroup size, workgroup memory limits, max subgroup size
Outputs
- Processed shaders -- WGSL code with appropriate macro substitutions and tiling decisions
- Shader decisions -- Optimal tile sizes and workgroup configurations
Usage Examples
Hash-based pipeline caching:
// Pipeline keys are hashable for unordered_map caching
std::unordered_map<ggml_webgpu_flash_attn_pipeline_key,
wgpu::ComputePipeline,
ggml_webgpu_flash_attn_pipeline_key_hash> fa_pipeline_cache;
// Lookup or create pipeline for given key
auto it = fa_pipeline_cache.find(key);
if (it == fa_pipeline_cache.end()) {
// Create and cache new pipeline variant
}
Related Pages
Implements Principle
Related Implementations
- Implementation:Ggml_org_Ggml_Webgpu_backend -- Main backend using this shader library
- Implementation:Ggml_org_Ggml_Webgpu_wgsl_preprocessor -- WGSL preprocessor for shader compilation