Implementation:Ggml org Ggml Cpu x86 quants

Metadata

Field	Value
Page Type	Implementation (Architecture-Specific SIMD)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated	2025-05-15 12:00 GMT

Overview

x86 SSE/AVX/AVX2/AVX-512 SIMD-optimized quantization, dequantization, and dot product routines for all GGML quantized tensor formats, providing the most feature-rich quantization implementation in the codebase.

Description

arch/x86/quants.c is the largest and most comprehensive quantization implementation in the GGML codebase, providing tiered SIMD support across the full range of x86 vector extensions: 128-bit SSE/SSSE3, 256-bit AVX/AVX2, and 512-bit AVX-512.

The file defines extensive SIMD helper functions that serve as building blocks for the quantization and dot product kernels:

Integer multiply-accumulate:

mul_sum_i8_pairs (SSE) -- int8 multiply-add using _mm_sign_epi8 and _mm_maddubs_epi16
mul_add_epi8 (AVX2) -- 256-bit equivalent using _mm256_sign_epi8 and _mm256_maddubs_epi16

Horizontal reductions:

hsum_float_8 (AVX) -- reduce 8 floats to a scalar via extract-and-add
hsum_i32_8 (AVX) -- reduce 8 int32s to a scalar
hsum_i32_4 (SSE) -- reduce 4 int32s to a scalar

Data unpacking:

bytes_from_bits_32 (AVX2) -- expand 32 bits to 32 bytes of 0x00/0xFF masks
bytes_from_nibbles_32 (AVX2) -- expand 16 bytes of packed 4-bit values to 32 bytes
sum_i16_pairs_float (AVX2) -- pairwise add int16 values and convert to float

The quantization functions (quantize_row_q8_0, quantize_row_q8_1, quantize_row_q8_K) use progressively wider SIMD widths based on the available extensions, with the widest available path selected at compile time.

The dot product functions cover all GGML quantization formats including standard types (q4_0, q4_1, q5_0, q5_1, q8_0), K-quants (q2_K through q6_K, tq1_0, tq2_0), importance quantization (iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq3_s, iq4_nl, iq4_xs, iq1_s, iq1_m), and mixed-precision formats (mxfp4).

SIMD paths are tiered with compile-time guards: __SSSE3__, __AVX__, __AVX2__, and __AVX512F__. Each successive tier uses wider vectors and more advanced instructions.

Usage

This file is compiled as part of the GGML CPU backend when targeting x86-64 platforms. Multiple variants may be compiled with different SIMD flags (e.g., one with AVX2, another with AVX-512) and the appropriate variant is selected at runtime by the feature detection in cpu-feats.cpp.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/x86/quants.c (3820 lines).

Key Signatures

// SIMD helpers
static inline __m128i mul_sum_i8_pairs(const __m128i x, const __m128i y);
static inline float hsum_float_8(const __m256 x);
static inline __m256i bytes_from_nibbles_32(const uint8_t * rsi);
static inline __m256i mul_add_epi8(const __m256i x, const __m256i y);

// Quantization
void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);

// Dot products (representative subset)
void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);

Import

#include "ggml-quants.h"
#include "ggml-cpu.h"
#include "simd-mappings.h"

I/O Contract

Inputs (Quantization)

Parameter	Type	Description
`x`	`const float *`	Source array of floating-point values to be quantized. Must contain at least `k` elements.
`k`	`int64_t`	Number of elements to quantize. Must be a multiple of the block size (32 for q8_0/q8_1, 256 for q8_K).

Outputs (Quantization)

Output	Type	Description
`vy`	`void *`	Destination buffer for the quantized block data.

Inputs (Dot Product)

Parameter	Type	Description
`n`	`int`	Number of elements in each input vector.
`vx`	`const void *`	Pointer to quantized weight data.
`vy`	`const void *`	Pointer to quantized activation data.
`nrc`	`int`	Number of rows to compute simultaneously.

Outputs (Dot Product)

Output	Type	Description
`s`	`float *`	Destination for the computed dot product result(s).

Usage Examples

// Quantize a row using x86 SIMD (automatically uses
// the widest available: AVX-512 > AVX2 > AVX > SSE)
float input[256];
block_q8_0 output[256 / QK8_0];

quantize_row_q8_0(input, output, 256);

// Compute quantized dot product
float result;
ggml_vec_dot_q4_0_q8_0(256, &result, sizeof(result),
    weight_blocks, sizeof(block_q4_0),
    activation_blocks, sizeof(block_q8_0), 1);

Related Pages

Principle:Ggml_org_Ggml_Architecture_Specific_SIMD_Quantization
Implementation:Ggml_org_Ggml_Cpu_x86_cpu_feats -- x86 CPU feature detection and backend scoring
Implementation:Ggml_org_Ggml_Cpu_x86_repack -- x86 matrix repacking and GEMM/GEMV kernels
Implementation:Ggml_org_Ggml_Cpu_arm_quants -- ARM NEON equivalent
Implementation:Ggml_org_Ggml_Cpu_loongarch_quants -- LoongArch LSX equivalent
Implementation:Ggml_org_Ggml_Cpu_powerpc_quants -- PowerPC VSX equivalent
Implementation:Ggml_org_Ggml_Cpu_riscv_quants -- RISC-V RVV equivalent
Implementation:Ggml_org_Ggml_Cpu_s390_quants -- s390x VXE equivalent
Implementation:Ggml_org_Ggml_Cpu_wasm_quants -- WebAssembly SIMD128 equivalent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment