Implementation:Ggml org Ggml Cpu weight repack

Metadata

Field	Value
Page Type	Implementation (Weight Repacking)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantization
Last Updated	2025-05-15 12:00 GMT

Overview

Implements weight repacking for optimized quantized GEMM/GEMV kernels, converting standard quantization block layouts into interleaved formats for better SIMD utilization.

Description

repack.cpp converts standard quantized weight layouts into interleaved block formats that enable faster SIMD processing. It provides:

Interleaved quantization: Generic implementations of ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8, and corresponding q8_K variants. These pack 4 adjacent quantization blocks into interleaved layouts (block_q4_0x4, block_q4_0x8, block_q8_0x4) where deltas are grouped first and quants are interleaved in fixed-size chunks.
Optimized GEMV/GEMM kernels: Implements ggml_gemv_q4_0_4x4_q8_0, ggml_gemm_q4_0_4x4_q8_0, and variants for q4_K, q5_K, q6_K, iq4_nl, q8_0, q2_K formats. These operate on the repacked data for better cache utilization and vectorization.
Backend buffer type: Registers as an extra_buffer_type with custom tensor_traits that intercept GGML_OP_MUL_MAT operations and redirect them to the repacked kernel implementations.
Architecture fallback: Functions use the _generic suffix and are aliased via arch-fallback.h. Architecture-specific optimized versions in arch/arm/repack.cpp, arch/x86/repack.cpp, etc., override these when available.

Usage

Weight repacking is activated automatically when the CPU backend's extra buffer types include the repack buffer type (enabled by GGML_USE_CPU_REPACK). Tensors allocated through this buffer type are transparently repacked.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/repack.cpp (3247 lines).

Signature

// Interleaved quantization packing
void ggml_quantize_mat_q8_0_4x4_generic(const float * GGML_RESTRICT x,
    void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_0_4x8_generic(const float * GGML_RESTRICT x,
    void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x4_generic(const float * GGML_RESTRICT x,
    void * GGML_RESTRICT vy, int64_t k);

// Optimized GEMV/GEMM on repacked data
void ggml_gemv_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s,
    size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s,
    size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);

// Backend buffer type registration
ggml_backend_buffer_type_t ggml_backend_cpu_repack_buffer_type(void);

Import

#include "repack.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
`x`	`const float *`	Yes	Source float data (for quantization packing) or quantized weights (for GEMM).
`vy`	`void *`	Yes	Destination buffer for interleaved quantized blocks.
`k`	`int64_t`	Yes	Number of elements per row (must be a multiple of the block size).
`n`	`int`	Yes (GEMM)	Inner dimension of the matrix multiplication.

Outputs

Output	Type	Description
`vy`	`void *`	Interleaved quantized block data.
`s`	`float *`	Matrix multiplication result buffer.

Usage Examples

Repacking Q8_0 Weights for 4x4 Interleaving

#include "repack.h"

// Source: 4 rows of k floats
float weights[4 * k];
block_q8_0x4 packed[k / QK8_0];

// Pack into interleaved 4x4 format
ggml_quantize_mat_q8_0_4x4_generic(weights, packed, k);

Related Pages

Ggml_org_Ggml_Cpu_backend_interface -- Registers the repack buffer type as an extra buffer.
Ggml_org_Ggml_Cpu_quantization -- Base quantization primitives that repack builds upon.
Ggml_org_Ggml_Cpu_tensor_ops -- Tensor operations that benefit from repacked weights.
Ggml_org_Ggml_Cpu_simd_mappings -- SIMD macros used in the repacked GEMM kernels.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment