Principle:Ggml org Ggml Weight Repacking
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Full Name | Ggml_org_Ggml_Weight_Repacking |
| Short Name | Weight_Repacking |
| Domain Tags | Quantization, Performance |
| Knowledge Source | GGML |
| Last Updated | 2026-02-10 |
Overview
Transforming quantized weight layouts from standard block format to interleaved formats optimized for specific hardware GEMM/GEMV kernels, enabling higher throughput by matching data layout to hardware execution patterns.
Description
Weight Repacking is the principle of rearranging quantized weight data from the standard portable block format (e.g., block_q4_0) into interleaved multi-block structures (e.g., block_q4_0x4, block_q4_0x8) that are optimized for specific hardware matrix multiplication patterns. While the standard block format prioritizes portability and simplicity, repacked formats are designed to maximize data throughput when feeding SIMD registers and matrix multiplication hardware units.
In the standard format, a single block_q4_0 contains one scale factor (d) and 32 quantized values packed into 16 bytes. In the repacked block_q4_0x4 format, four consecutive blocks are interleaved: the four scale factors (d[4]) are stored first, followed by the quantized values from all four blocks interleaved at a configurable granularity (blck_size_interleave). This layout ensures that when a SIMD instruction loads a vector-width chunk of data, it gets values from multiple output rows simultaneously, matching the access pattern of tiled GEMM/GEMV kernels.
GGML implements repacking for multiple quantization types and interleave widths:
| Repacked Type | Base Type | Interleave Factor | Target Use |
|---|---|---|---|
block_q4_0x4 |
Q4_0 | 4 blocks | ARM NEON 4-wide GEMV/GEMM |
block_q4_0x8 |
Q4_0 | 8 blocks | ARM SVE/SME, AVX-512 8-wide GEMM |
block_q8_0x4 |
Q8_0 | 4 blocks | 4-wide quantized activation packing |
block_q8_0x8 |
Q8_0 | 8 blocks | 8-wide quantized activation packing |
block_q4_Kx8 |
Q4_K | 8 super-blocks | K-quant 8-wide GEMM |
block_q2_Kx8 |
Q2_K | 8 super-blocks | 2-bit K-quant 8-wide GEMM |
block_q5_Kx8 |
Q5_K | 8 super-blocks | 5-bit K-quant 8-wide GEMM |
block_q6_Kx8 |
Q6_K | 8 super-blocks | 6-bit K-quant 8-wide GEMM |
block_q8_Kx4 |
Q8_K | 4 super-blocks | K-quant activation packing |
block_iq4_nlx4 |
IQ4_NL | 4 blocks | IQ4 4-wide operations |
block_iq4_nlx8 |
IQ4_NL | 8 blocks | IQ4 8-wide operations |
The repacking is performed through a dedicated buffer type (ggml_backend_cpu_repack_buffer_type) that transparently converts weights from standard to interleaved format when they are loaded into the repack buffer. Architecture-specific implementations exist for ARM (arch/arm/repack.cpp), x86 (arch/x86/repack.cpp), and RISC-V (arch/riscv/repack.cpp), each optimized for their respective instruction set.
An XOR mask transformation is applied during repacking for some formats: the nibbles in Q4_0 quants are converted from bias-offset form (values 0-15 representing -8 to +7) to pure signed form using an XOR with 0x88. This eliminates a subtract-8 operation during unpacking in the GEMM kernel, saving one instruction per element in the inner loop.
Usage
Weight repacking is applied as a one-time preprocessing step when loading model weights:
- Model loading with repack buffer: When an application allocates weights into a repack buffer type, the weights are automatically converted from standard block format to the interleaved format matching the current hardware. This happens once at load time, amortized over all subsequent inference calls.
- GEMM/GEMV kernel selection: The tiled GEMM/GEMV kernels (e.g.,
ggml_gemm_q4_0_4x4_q8_0,ggml_gemm_q4_0_8x8_q8_0) expect inputs in the corresponding repacked format. The "4x4" or "8x8" suffix indicates the interleave factor and tile dimensions. - Platform-specific optimization: The ARM repack path uses NEON instructions for fast interleaving, the x86 path uses AVX/AVX-512 shuffle instructions, and the RISC-V path uses vector gather operations. Each produces the same logical interleaved format but with architecture-optimal conversion code.
- Activation repacking: In addition to weight repacking, the activation vectors (inputs) are quantized into interleaved Q8_0 format (
ggml_quantize_mat_q8_0_4x4,ggml_quantize_mat_q8_0_4x8) on the fly during inference to match the kernel's expected layout.
Theoretical Basis
Data Layout for SIMD Throughput
SIMD instructions load contiguous memory into vector registers. In a naive layout where all elements of one matrix row are contiguous, a SIMD load gets data for one output element at a time. In an interleaved layout where elements from N rows are interleaved, a single SIMD load retrieves data contributing to N output elements simultaneously. For a 128-bit NEON register processing 4-bit values, interleaving 4 rows means each vector load provides 4 partial dot products in parallel. For 512-bit AVX-512, interleaving 8 rows provides 8 partial dot products per load. This directly maps to the hardware's ability to compute multiple output elements per clock cycle.
Tiled Matrix Multiplication
Tiled (blocked) matrix multiplication partitions the output matrix into small tiles (e.g., 4x4 or 8x8 elements) and computes each tile using a sequence of vector dot product instructions. The tile dimensions are chosen to match the SIMD register file size: with 32 vector registers (ARM NEON, AVX-512), a 4x8 tile uses 32 registers as accumulators, fully utilizing the register file. The repacked weight format ensures that the data needed for one tile is laid out contiguously in memory, enabling sequential access patterns that are cache-friendly and avoid gather operations.
Compile-Time Size Verification
The repacked block structures use C++ templates (template <int K, int N> struct block) and static_assert statements to verify at compile time that the structure sizes match expectations. For example, block<4,4> (Q4_0 interleaved 4-wide) must have exactly 4 * sizeof(ggml_half) + QK8_0 * 2 bytes. This catches alignment or padding errors at compile time rather than at runtime, a critical safety measure for binary data formats.
Amortized Repacking Cost
Repacking has a one-time cost proportional to the model size. For a 7B-parameter model in Q4_0 format (~3.5 GB), repacking takes on the order of tens of milliseconds -- negligible compared to the total load time from disk. The per-inference benefit, however, is a sustained throughput improvement on every matrix multiplication operation (potentially hundreds per token), making the amortized cost per token essentially zero.