Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cpu wasm quants

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Architecture-Specific SIMD)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated 2025-05-15 12:00 GMT

Overview

WebAssembly SIMD128-optimized quantization, dequantization, and dot product routines for GGML quantized tensor formats, enabling fast inference in web browsers and WASM runtimes.

Description

arch/wasm/quants.c implements WebAssembly SIMD128-specific acceleration for GGML quantization operations, targeting web browsers and WASM runtimes that support the SIMD128 proposal.

The implementation uses the wasm_simd128.h intrinsics API, working with 128-bit v128_t vectors (4x float32):

  • wasm_v128_load -- load 128 bits from memory
  • wasm_f32x4_abs / wasm_f32x4_max -- absolute value and element-wise maximum
  • wasm_f32x4_mul / wasm_f32x4_splat -- multiply and broadcast scalar
  • wasm_i32x4_trunc_sat_f32x4 -- saturating float-to-integer truncation
  • wasm_f32x4_extract_lane / wasm_i32x4_extract_lane -- extract scalar from lane

The quantization pattern follows the standard approach used across all architectures: load eight groups of four floats, find the block maximum via tree reduction, compute a scale factor, multiply-and-round to integer, and store packed results. The WASM implementation notably uses wasm_i32x4_trunc_sat_f32x4 for float-to-int conversion (truncation rather than rounding), requiring the scale to be pre-adjusted.

Precomputed bit-expansion tables (table_b2b_0, table_b2b_1) support sub-byte format unpacking. All SIMD paths are guarded by #if defined(__wasm_simd128__) and fall back to scalar reference implementations otherwise.

Usage

This file is compiled when GGML is built as a WebAssembly module (e.g., using Emscripten) with SIMD128 support enabled. It powers web-based LLM inference applications running in browsers or Node.js environments.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/wasm/quants.c (1221 lines).

Key Signatures

void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);

void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);

Import

#include "ggml-quants.h"
#include "ggml-cpu.h"
#include "simd-mappings.h"

I/O Contract

Inputs (Quantization)

Parameter Type Description
x const float * Source array of floating-point values to be quantized.
k int64_t Number of elements to quantize. Must be a multiple of the block size.

Outputs (Quantization)

Output Type Description
vy void * Destination buffer for the quantized block data.

Inputs (Dot Product)

Parameter Type Description
n int Number of elements in each input vector.
vx const void * Pointer to quantized weight data.
vy const void * Pointer to quantized activation data.
nrc int Number of rows to compute simultaneously.

Outputs (Dot Product)

Output Type Description
s float * Destination for the computed dot product result(s).

Usage Examples

// Quantize a row using WebAssembly SIMD128
// (compiled via Emscripten with -msimd128)
float input[256];
block_q8_0 output[256 / QK8_0];

quantize_row_q8_0(input, output, 256);

// Compute quantized dot product in a WASM runtime
float result;
ggml_vec_dot_q4_0_q8_0(256, &result, sizeof(result),
    weight_blocks, sizeof(block_q4_0),
    activation_blocks, sizeof(block_q8_0), 1);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment