Implementation:Ggml org Ggml Cpu x86 repack
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Architecture-Specific SIMD) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, SIMD_Optimization |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
x86 AVX/AVX2/AVX-512 optimized matrix repacking, quantized GEMM, and GEMV kernels for interleaved block formats, providing the most comprehensive repack implementation in the codebase at 6307 lines.
Description
arch/x86/repack.cpp is the largest repack file in the GGML codebase, providing comprehensive x86 SIMD-optimized matrix multiplication kernels across multiple SIMD tiers.
The file begins with FP16-to-FP32 loading macros that adapt to available instruction set extensions:
With F16C (hardware FP16 conversion):
GGML_F32Cx8_LOAD-- loads 8 FP16 values and converts to FP32 via_mm256_cvtph_psGGML_F32Cx8_REPEAT_LOAD-- loads 4 FP16 values and repeats themGGML_F32Cx8_REARRANGE_LOAD-- loads with byte shuffle rearrangementGGML_F32Cx8x2_LOAD(AVX-512) -- loads 16 FP16 values into a 512-bit registerGGML_F32Cx16_REPEAT_LOAD(AVX-512) -- loads and repeats 4 FP16 values across 16 lanes
Without F16C (software fallback):
- Scalar conversion loops that manually convert each FP16 element via
GGML_CPU_FP16_TO_FP32
These loading macros feed into the following kernel categories:
Matrix repacking functions (ggml_quantize_mat_q8_0_4x8, ggml_quantize_mat_q8_K_4x8) quantize multiple rows simultaneously into interleaved block formats optimized for SIMD matrix multiplication.
GEMV kernels (ggml_gemv_q4_0_8x8_q8_0, ggml_gemv_q4_K_8x8_q8_K, ggml_gemv_iq4_nl_8x8_q8_0, ggml_gemv_q2_K_8x8_q8_K) perform matrix-vector products on interleaved quantized weight blocks.
GEMM kernels (ggml_gemm_q4_0_8x8_q8_0, ggml_gemm_q4_K_8x8_q8_K, ggml_gemm_iq4_nl_8x8_q8_0, ggml_gemm_q2_K_8x8_q8_K) perform matrix-matrix products for batched inference.
The x86 repack implementation supports more quantization format combinations than other architectures, reflecting the maturity of x86 SIMD optimization in GGML. SIMD paths are tiered with guards for __AVX__, __AVX2__, __AVX512F__, and __F16C__.
Usage
This file is compiled as part of the GGML CPU backend when targeting x86-64 platforms. The repacking functions are called during weight preparation, and the GEMV/GEMM kernels are invoked during inference when the scheduler selects the interleaved matrix multiplication path.
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/arch/x86/repack.cpp (6307 lines).
Key Signatures
// FP16 loading macros (with F16C)
#define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((const __m128i *)(x)))
#define GGML_F32Cx8x2_LOAD(x, y) _mm512_cvtph_ps(...) // AVX-512
// Repacking
void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
// GEMV
void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemv_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
// GEMM
void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
Import
#include "ggml-common.h"
#include "ggml-backend-impl.h"
#include "ggml-cpu.h"
#include "../../repack.h"
I/O Contract
Inputs (Repacking)
| Parameter | Type | Description |
|---|---|---|
x |
const float * |
Source floating-point data laid out as consecutive rows. The function reads multiple rows simultaneously. |
k |
int64_t |
Number of elements per row. Must be a multiple of the block size. |
Outputs (Repacking)
| Output | Type | Description |
|---|---|---|
vy |
void * |
Destination buffer for interleaved quantized blocks. |
Inputs (GEMV / GEMM)
| Parameter | Type | Description |
|---|---|---|
n |
int |
Inner dimension size (number of elements per dot product). |
vx |
const void * |
Pointer to the interleaved quantized weight matrix. |
vy |
const void * |
Pointer to the quantized activation vector or matrix. |
nr |
int |
Number of rows in the output. |
nc |
int |
Number of columns in the output. |
Outputs (GEMV / GEMM)
| Output | Type | Description |
|---|---|---|
s |
float * |
Destination buffer for the floating-point result matrix/vector. |
Usage Examples
// Repack rows into interleaved q8_0 format for x86 AVX
float rows[4 * 256];
block_q8_0x4 repacked[256 / QK8_0];
ggml_quantize_mat_q8_0_4x8(rows, repacked, 256);
// Perform GEMM with interleaved weights on x86
float output[8 * 8];
ggml_gemm_q4_0_8x8_q8_0(256, output, 8 * sizeof(float),
interleaved_weights, quantized_activations, 8, 8);
Related Pages
- Principle:Ggml_org_Ggml_Architecture_Specific_SIMD_Quantization
- Implementation:Ggml_org_Ggml_Cpu_x86_quants -- x86 quantization and dot product routines
- Implementation:Ggml_org_Ggml_Cpu_x86_cpu_feats -- x86 CPU feature detection and backend scoring
- Implementation:Ggml_org_Ggml_Cpu_arm_repack -- ARM NEON equivalent
- Implementation:Ggml_org_Ggml_Cpu_riscv_repack -- RISC-V RVV equivalent