Implementation:Ggml org Ggml Cpu x86 repack

Metadata

Field	Value
Page Type	Implementation (Architecture-Specific SIMD)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated	2025-05-15 12:00 GMT

Overview

x86 AVX/AVX2/AVX-512 optimized matrix repacking, quantized GEMM, and GEMV kernels for interleaved block formats, providing the most comprehensive repack implementation in the codebase at 6307 lines.

Description

arch/x86/repack.cpp is the largest repack file in the GGML codebase, providing comprehensive x86 SIMD-optimized matrix multiplication kernels across multiple SIMD tiers.

The file begins with FP16-to-FP32 loading macros that adapt to available instruction set extensions:

With F16C (hardware FP16 conversion):

GGML_F32Cx8_LOAD -- loads 8 FP16 values and converts to FP32 via _mm256_cvtph_ps
GGML_F32Cx8_REPEAT_LOAD -- loads 4 FP16 values and repeats them
GGML_F32Cx8_REARRANGE_LOAD -- loads with byte shuffle rearrangement
GGML_F32Cx8x2_LOAD (AVX-512) -- loads 16 FP16 values into a 512-bit register
GGML_F32Cx16_REPEAT_LOAD (AVX-512) -- loads and repeats 4 FP16 values across 16 lanes

Without F16C (software fallback):

Scalar conversion loops that manually convert each FP16 element via GGML_CPU_FP16_TO_FP32

These loading macros feed into the following kernel categories:

Matrix repacking functions (ggml_quantize_mat_q8_0_4x8, ggml_quantize_mat_q8_K_4x8) quantize multiple rows simultaneously into interleaved block formats optimized for SIMD matrix multiplication.

GEMV kernels (ggml_gemv_q4_0_8x8_q8_0, ggml_gemv_q4_K_8x8_q8_K, ggml_gemv_iq4_nl_8x8_q8_0, ggml_gemv_q2_K_8x8_q8_K) perform matrix-vector products on interleaved quantized weight blocks.

GEMM kernels (ggml_gemm_q4_0_8x8_q8_0, ggml_gemm_q4_K_8x8_q8_K, ggml_gemm_iq4_nl_8x8_q8_0, ggml_gemm_q2_K_8x8_q8_K) perform matrix-matrix products for batched inference.

The x86 repack implementation supports more quantization format combinations than other architectures, reflecting the maturity of x86 SIMD optimization in GGML. SIMD paths are tiered with guards for __AVX__, __AVX2__, __AVX512F__, and __F16C__.

Usage

This file is compiled as part of the GGML CPU backend when targeting x86-64 platforms. The repacking functions are called during weight preparation, and the GEMV/GEMM kernels are invoked during inference when the scheduler selects the interleaved matrix multiplication path.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/x86/repack.cpp (6307 lines).

Key Signatures

// FP16 loading macros (with F16C)
#define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((const __m128i *)(x)))
#define GGML_F32Cx8x2_LOAD(x, y) _mm512_cvtph_ps(...)  // AVX-512

// Repacking
void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);

// GEMV
void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemv_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);

// GEMM
void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);

Import

#include "ggml-common.h"
#include "ggml-backend-impl.h"
#include "ggml-cpu.h"
#include "../../repack.h"

I/O Contract

Inputs (Repacking)

Parameter	Type	Description
`x`	`const float *`	Source floating-point data laid out as consecutive rows. The function reads multiple rows simultaneously.
`k`	`int64_t`	Number of elements per row. Must be a multiple of the block size.

Outputs (Repacking)

Output	Type	Description
`vy`	`void *`	Destination buffer for interleaved quantized blocks.

Inputs (GEMV / GEMM)

Parameter	Type	Description
`n`	`int`	Inner dimension size (number of elements per dot product).
`vx`	`const void *`	Pointer to the interleaved quantized weight matrix.
`vy`	`const void *`	Pointer to the quantized activation vector or matrix.
`nr`	`int`	Number of rows in the output.
`nc`	`int`	Number of columns in the output.

Outputs (GEMV / GEMM)

Output	Type	Description
`s`	`float *`	Destination buffer for the floating-point result matrix/vector.

Usage Examples

// Repack rows into interleaved q8_0 format for x86 AVX
float rows[4 * 256];
block_q8_0x4 repacked[256 / QK8_0];

ggml_quantize_mat_q8_0_4x8(rows, repacked, 256);

// Perform GEMM with interleaved weights on x86
float output[8 * 8];
ggml_gemm_q4_0_8x8_q8_0(256, output, 8 * sizeof(float),
    interleaved_weights, quantized_activations, 8, 8);

Related Pages

Principle:Ggml_org_Ggml_Architecture_Specific_SIMD_Quantization
Implementation:Ggml_org_Ggml_Cpu_x86_quants -- x86 quantization and dot product routines
Implementation:Ggml_org_Ggml_Cpu_x86_cpu_feats -- x86 CPU feature detection and backend scoring
Implementation:Ggml_org_Ggml_Cpu_arm_repack -- ARM NEON equivalent
Implementation:Ggml_org_Ggml_Cpu_riscv_repack -- RISC-V RVV equivalent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment