Principle:Ggml org Ggml Weight Repacking

Attribute	Value
Page Type	Principle
Full Name	Ggml_org_Ggml_Weight_Repacking
Short Name	Weight_Repacking
Domain Tags	Quantization, Performance
Knowledge Source	GGML
Last Updated	2026-02-10

Overview

Transforming quantized weight layouts from standard block format to interleaved formats optimized for specific hardware GEMM/GEMV kernels, enabling higher throughput by matching data layout to hardware execution patterns.

Description

Weight Repacking is the principle of rearranging quantized weight data from the standard portable block format (e.g., block_q4_0) into interleaved multi-block structures (e.g., block_q4_0x4, block_q4_0x8) that are optimized for specific hardware matrix multiplication patterns. While the standard block format prioritizes portability and simplicity, repacked formats are designed to maximize data throughput when feeding SIMD registers and matrix multiplication hardware units.

In the standard format, a single block_q4_0 contains one scale factor (d) and 32 quantized values packed into 16 bytes. In the repacked block_q4_0x4 format, four consecutive blocks are interleaved: the four scale factors (d[4]) are stored first, followed by the quantized values from all four blocks interleaved at a configurable granularity (blck_size_interleave). This layout ensures that when a SIMD instruction loads a vector-width chunk of data, it gets values from multiple output rows simultaneously, matching the access pattern of tiled GEMM/GEMV kernels.

GGML implements repacking for multiple quantization types and interleave widths:

Repacked Type	Base Type	Interleave Factor	Target Use
`block_q4_0x4`	Q4_0	4 blocks	ARM NEON 4-wide GEMV/GEMM
`block_q4_0x8`	Q4_0	8 blocks	ARM SVE/SME, AVX-512 8-wide GEMM
`block_q8_0x4`	Q8_0	4 blocks	4-wide quantized activation packing
`block_q8_0x8`	Q8_0	8 blocks	8-wide quantized activation packing
`block_q4_Kx8`	Q4_K	8 super-blocks	K-quant 8-wide GEMM
`block_q2_Kx8`	Q2_K	8 super-blocks	2-bit K-quant 8-wide GEMM
`block_q5_Kx8`	Q5_K	8 super-blocks	5-bit K-quant 8-wide GEMM
`block_q6_Kx8`	Q6_K	8 super-blocks	6-bit K-quant 8-wide GEMM
`block_q8_Kx4`	Q8_K	4 super-blocks	K-quant activation packing
`block_iq4_nlx4`	IQ4_NL	4 blocks	IQ4 4-wide operations
`block_iq4_nlx8`	IQ4_NL	8 blocks	IQ4 8-wide operations

The repacking is performed through a dedicated buffer type (ggml_backend_cpu_repack_buffer_type) that transparently converts weights from standard to interleaved format when they are loaded into the repack buffer. Architecture-specific implementations exist for ARM (arch/arm/repack.cpp), x86 (arch/x86/repack.cpp), and RISC-V (arch/riscv/repack.cpp), each optimized for their respective instruction set.

An XOR mask transformation is applied during repacking for some formats: the nibbles in Q4_0 quants are converted from bias-offset form (values 0-15 representing -8 to +7) to pure signed form using an XOR with 0x88. This eliminates a subtract-8 operation during unpacking in the GEMM kernel, saving one instruction per element in the inner loop.

Usage

Weight repacking is applied as a one-time preprocessing step when loading model weights:

Model loading with repack buffer: When an application allocates weights into a repack buffer type, the weights are automatically converted from standard block format to the interleaved format matching the current hardware. This happens once at load time, amortized over all subsequent inference calls.
GEMM/GEMV kernel selection: The tiled GEMM/GEMV kernels (e.g., ggml_gemm_q4_0_4x4_q8_0, ggml_gemm_q4_0_8x8_q8_0) expect inputs in the corresponding repacked format. The "4x4" or "8x8" suffix indicates the interleave factor and tile dimensions.
Platform-specific optimization: The ARM repack path uses NEON instructions for fast interleaving, the x86 path uses AVX/AVX-512 shuffle instructions, and the RISC-V path uses vector gather operations. Each produces the same logical interleaved format but with architecture-optimal conversion code.
Activation repacking: In addition to weight repacking, the activation vectors (inputs) are quantized into interleaved Q8_0 format (ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8) on the fly during inference to match the kernel's expected layout.

Theoretical Basis

Data Layout for SIMD Throughput

SIMD instructions load contiguous memory into vector registers. In a naive layout where all elements of one matrix row are contiguous, a SIMD load gets data for one output element at a time. In an interleaved layout where elements from N rows are interleaved, a single SIMD load retrieves data contributing to N output elements simultaneously. For a 128-bit NEON register processing 4-bit values, interleaving 4 rows means each vector load provides 4 partial dot products in parallel. For 512-bit AVX-512, interleaving 8 rows provides 8 partial dot products per load. This directly maps to the hardware's ability to compute multiple output elements per clock cycle.

Tiled Matrix Multiplication

Tiled (blocked) matrix multiplication partitions the output matrix into small tiles (e.g., 4x4 or 8x8 elements) and computes each tile using a sequence of vector dot product instructions. The tile dimensions are chosen to match the SIMD register file size: with 32 vector registers (ARM NEON, AVX-512), a 4x8 tile uses 32 registers as accumulators, fully utilizing the register file. The repacked weight format ensures that the data needed for one tile is laid out contiguously in memory, enabling sequential access patterns that are cache-friendly and avoid gather operations.

Compile-Time Size Verification

The repacked block structures use C++ templates (template <int K, int N> struct block) and static_assert statements to verify at compile time that the structure sizes match expectations. For example, block<4,4> (Q4_0 interleaved 4-wide) must have exactly 4 * sizeof(ggml_half) + QK8_0 * 2 bytes. This catches alignment or padding errors at compile time rather than at runtime, a critical safety measure for binary data formats.

Amortized Repacking Cost

Repacking has a one-time cost proportional to the model size. For a 7B-parameter model in Q4_0 format (~3.5 GB), repacking takes on the order of tens of milliseconds -- negligible compared to the total load time from disk. The per-inference benefit, however, is a sustained throughput improvement on every matrix multiplication operation (potentially hundreds per token), making the amortized cost per token essentially zero.

Related Pages

Implemented By

Implementation:Ggml_org_Ggml_Cpu_weight_repack

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment