Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cpu x86 repack

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Architecture-Specific SIMD)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated 2025-05-15 12:00 GMT

Overview

x86 AVX/AVX2/AVX-512 optimized matrix repacking, quantized GEMM, and GEMV kernels for interleaved block formats, providing the most comprehensive repack implementation in the codebase at 6307 lines.

Description

arch/x86/repack.cpp is the largest repack file in the GGML codebase, providing comprehensive x86 SIMD-optimized matrix multiplication kernels across multiple SIMD tiers.

The file begins with FP16-to-FP32 loading macros that adapt to available instruction set extensions:

With F16C (hardware FP16 conversion):

  • GGML_F32Cx8_LOAD -- loads 8 FP16 values and converts to FP32 via _mm256_cvtph_ps
  • GGML_F32Cx8_REPEAT_LOAD -- loads 4 FP16 values and repeats them
  • GGML_F32Cx8_REARRANGE_LOAD -- loads with byte shuffle rearrangement
  • GGML_F32Cx8x2_LOAD (AVX-512) -- loads 16 FP16 values into a 512-bit register
  • GGML_F32Cx16_REPEAT_LOAD (AVX-512) -- loads and repeats 4 FP16 values across 16 lanes

Without F16C (software fallback):

  • Scalar conversion loops that manually convert each FP16 element via GGML_CPU_FP16_TO_FP32

These loading macros feed into the following kernel categories:

Matrix repacking functions (ggml_quantize_mat_q8_0_4x8, ggml_quantize_mat_q8_K_4x8) quantize multiple rows simultaneously into interleaved block formats optimized for SIMD matrix multiplication.

GEMV kernels (ggml_gemv_q4_0_8x8_q8_0, ggml_gemv_q4_K_8x8_q8_K, ggml_gemv_iq4_nl_8x8_q8_0, ggml_gemv_q2_K_8x8_q8_K) perform matrix-vector products on interleaved quantized weight blocks.

GEMM kernels (ggml_gemm_q4_0_8x8_q8_0, ggml_gemm_q4_K_8x8_q8_K, ggml_gemm_iq4_nl_8x8_q8_0, ggml_gemm_q2_K_8x8_q8_K) perform matrix-matrix products for batched inference.

The x86 repack implementation supports more quantization format combinations than other architectures, reflecting the maturity of x86 SIMD optimization in GGML. SIMD paths are tiered with guards for __AVX__, __AVX2__, __AVX512F__, and __F16C__.

Usage

This file is compiled as part of the GGML CPU backend when targeting x86-64 platforms. The repacking functions are called during weight preparation, and the GEMV/GEMM kernels are invoked during inference when the scheduler selects the interleaved matrix multiplication path.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/x86/repack.cpp (6307 lines).

Key Signatures

// FP16 loading macros (with F16C)
#define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((const __m128i *)(x)))
#define GGML_F32Cx8x2_LOAD(x, y) _mm512_cvtph_ps(...)  // AVX-512

// Repacking
void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);

// GEMV
void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemv_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);

// GEMM
void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);

Import

#include "ggml-common.h"
#include "ggml-backend-impl.h"
#include "ggml-cpu.h"
#include "../../repack.h"

I/O Contract

Inputs (Repacking)

Parameter Type Description
x const float * Source floating-point data laid out as consecutive rows. The function reads multiple rows simultaneously.
k int64_t Number of elements per row. Must be a multiple of the block size.

Outputs (Repacking)

Output Type Description
vy void * Destination buffer for interleaved quantized blocks.

Inputs (GEMV / GEMM)

Parameter Type Description
n int Inner dimension size (number of elements per dot product).
vx const void * Pointer to the interleaved quantized weight matrix.
vy const void * Pointer to the quantized activation vector or matrix.
nr int Number of rows in the output.
nc int Number of columns in the output.

Outputs (GEMV / GEMM)

Output Type Description
s float * Destination buffer for the floating-point result matrix/vector.

Usage Examples

// Repack rows into interleaved q8_0 format for x86 AVX
float rows[4 * 256];
block_q8_0x4 repacked[256 / QK8_0];

ggml_quantize_mat_q8_0_4x8(rows, repacked, 256);

// Perform GEMM with interleaved weights on x86
float output[8 * 8];
ggml_gemm_q4_0_8x8_q8_0(256, output, 8 * sizeof(float),
    interleaved_weights, quantized_activations, 8, 8);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment