Implementation:Ggml org Ggml Cpu weight repack
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Weight Repacking) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantization |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Implements weight repacking for optimized quantized GEMM/GEMV kernels, converting standard quantization block layouts into interleaved formats for better SIMD utilization.
Description
repack.cpp converts standard quantized weight layouts into interleaved block formats that enable faster SIMD processing. It provides:
- Interleaved quantization: Generic implementations of
ggml_quantize_mat_q8_0_4x4,ggml_quantize_mat_q8_0_4x8, and corresponding q8_K variants. These pack 4 adjacent quantization blocks into interleaved layouts (block_q4_0x4,block_q4_0x8,block_q8_0x4) where deltas are grouped first and quants are interleaved in fixed-size chunks. - Optimized GEMV/GEMM kernels: Implements
ggml_gemv_q4_0_4x4_q8_0,ggml_gemm_q4_0_4x4_q8_0, and variants for q4_K, q5_K, q6_K, iq4_nl, q8_0, q2_K formats. These operate on the repacked data for better cache utilization and vectorization. - Backend buffer type: Registers as an
extra_buffer_typewith customtensor_traitsthat interceptGGML_OP_MUL_MAToperations and redirect them to the repacked kernel implementations. - Architecture fallback: Functions use the
_genericsuffix and are aliased viaarch-fallback.h. Architecture-specific optimized versions inarch/arm/repack.cpp,arch/x86/repack.cpp, etc., override these when available.
Usage
Weight repacking is activated automatically when the CPU backend's extra buffer types include the repack buffer type (enabled by GGML_USE_CPU_REPACK). Tensors allocated through this buffer type are transparently repacked.
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/repack.cpp (3247 lines).
Signature
// Interleaved quantization packing
void ggml_quantize_mat_q8_0_4x4_generic(const float * GGML_RESTRICT x,
void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_0_4x8_generic(const float * GGML_RESTRICT x,
void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x4_generic(const float * GGML_RESTRICT x,
void * GGML_RESTRICT vy, int64_t k);
// Optimized GEMV/GEMM on repacked data
void ggml_gemv_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s,
size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s,
size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
// Backend buffer type registration
ggml_backend_buffer_type_t ggml_backend_cpu_repack_buffer_type(void);
Import
#include "repack.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
x |
const float * |
Yes | Source float data (for quantization packing) or quantized weights (for GEMM). |
vy |
void * |
Yes | Destination buffer for interleaved quantized blocks. |
k |
int64_t |
Yes | Number of elements per row (must be a multiple of the block size). |
n |
int |
Yes (GEMM) | Inner dimension of the matrix multiplication. |
Outputs
| Output | Type | Description |
|---|---|---|
vy |
void * |
Interleaved quantized block data. |
s |
float * |
Matrix multiplication result buffer. |
Usage Examples
Repacking Q8_0 Weights for 4x4 Interleaving
#include "repack.h"
// Source: 4 rows of k floats
float weights[4 * k];
block_q8_0x4 packed[k / QK8_0];
// Pack into interleaved 4x4 format
ggml_quantize_mat_q8_0_4x4_generic(weights, packed, k);
Related Pages
- Ggml_org_Ggml_Cpu_backend_interface -- Registers the repack buffer type as an extra buffer.
- Ggml_org_Ggml_Cpu_quantization -- Base quantization primitives that repack builds upon.
- Ggml_org_Ggml_Cpu_tensor_ops -- Tensor operations that benefit from repacked weights.
- Ggml_org_Ggml_Cpu_simd_mappings -- SIMD macros used in the repacked GEMM kernels.