Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml Weight Repacking

From Leeroopedia


Attribute Value
Page Type Principle
Full Name Ggml_org_Ggml_Weight_Repacking
Short Name Weight_Repacking
Domain Tags Quantization, Performance
Knowledge Source GGML
Last Updated 2026-02-10

Overview

Transforming quantized weight layouts from standard block format to interleaved formats optimized for specific hardware GEMM/GEMV kernels, enabling higher throughput by matching data layout to hardware execution patterns.

Description

Weight Repacking is the principle of rearranging quantized weight data from the standard portable block format (e.g., block_q4_0) into interleaved multi-block structures (e.g., block_q4_0x4, block_q4_0x8) that are optimized for specific hardware matrix multiplication patterns. While the standard block format prioritizes portability and simplicity, repacked formats are designed to maximize data throughput when feeding SIMD registers and matrix multiplication hardware units.

In the standard format, a single block_q4_0 contains one scale factor (d) and 32 quantized values packed into 16 bytes. In the repacked block_q4_0x4 format, four consecutive blocks are interleaved: the four scale factors (d[4]) are stored first, followed by the quantized values from all four blocks interleaved at a configurable granularity (blck_size_interleave). This layout ensures that when a SIMD instruction loads a vector-width chunk of data, it gets values from multiple output rows simultaneously, matching the access pattern of tiled GEMM/GEMV kernels.

GGML implements repacking for multiple quantization types and interleave widths:

Repacked Type Base Type Interleave Factor Target Use
block_q4_0x4 Q4_0 4 blocks ARM NEON 4-wide GEMV/GEMM
block_q4_0x8 Q4_0 8 blocks ARM SVE/SME, AVX-512 8-wide GEMM
block_q8_0x4 Q8_0 4 blocks 4-wide quantized activation packing
block_q8_0x8 Q8_0 8 blocks 8-wide quantized activation packing
block_q4_Kx8 Q4_K 8 super-blocks K-quant 8-wide GEMM
block_q2_Kx8 Q2_K 8 super-blocks 2-bit K-quant 8-wide GEMM
block_q5_Kx8 Q5_K 8 super-blocks 5-bit K-quant 8-wide GEMM
block_q6_Kx8 Q6_K 8 super-blocks 6-bit K-quant 8-wide GEMM
block_q8_Kx4 Q8_K 4 super-blocks K-quant activation packing
block_iq4_nlx4 IQ4_NL 4 blocks IQ4 4-wide operations
block_iq4_nlx8 IQ4_NL 8 blocks IQ4 8-wide operations

The repacking is performed through a dedicated buffer type (ggml_backend_cpu_repack_buffer_type) that transparently converts weights from standard to interleaved format when they are loaded into the repack buffer. Architecture-specific implementations exist for ARM (arch/arm/repack.cpp), x86 (arch/x86/repack.cpp), and RISC-V (arch/riscv/repack.cpp), each optimized for their respective instruction set.

An XOR mask transformation is applied during repacking for some formats: the nibbles in Q4_0 quants are converted from bias-offset form (values 0-15 representing -8 to +7) to pure signed form using an XOR with 0x88. This eliminates a subtract-8 operation during unpacking in the GEMM kernel, saving one instruction per element in the inner loop.

Usage

Weight repacking is applied as a one-time preprocessing step when loading model weights:

  • Model loading with repack buffer: When an application allocates weights into a repack buffer type, the weights are automatically converted from standard block format to the interleaved format matching the current hardware. This happens once at load time, amortized over all subsequent inference calls.
  • GEMM/GEMV kernel selection: The tiled GEMM/GEMV kernels (e.g., ggml_gemm_q4_0_4x4_q8_0, ggml_gemm_q4_0_8x8_q8_0) expect inputs in the corresponding repacked format. The "4x4" or "8x8" suffix indicates the interleave factor and tile dimensions.
  • Platform-specific optimization: The ARM repack path uses NEON instructions for fast interleaving, the x86 path uses AVX/AVX-512 shuffle instructions, and the RISC-V path uses vector gather operations. Each produces the same logical interleaved format but with architecture-optimal conversion code.
  • Activation repacking: In addition to weight repacking, the activation vectors (inputs) are quantized into interleaved Q8_0 format (ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8) on the fly during inference to match the kernel's expected layout.

Theoretical Basis

Data Layout for SIMD Throughput

SIMD instructions load contiguous memory into vector registers. In a naive layout where all elements of one matrix row are contiguous, a SIMD load gets data for one output element at a time. In an interleaved layout where elements from N rows are interleaved, a single SIMD load retrieves data contributing to N output elements simultaneously. For a 128-bit NEON register processing 4-bit values, interleaving 4 rows means each vector load provides 4 partial dot products in parallel. For 512-bit AVX-512, interleaving 8 rows provides 8 partial dot products per load. This directly maps to the hardware's ability to compute multiple output elements per clock cycle.

Tiled Matrix Multiplication

Tiled (blocked) matrix multiplication partitions the output matrix into small tiles (e.g., 4x4 or 8x8 elements) and computes each tile using a sequence of vector dot product instructions. The tile dimensions are chosen to match the SIMD register file size: with 32 vector registers (ARM NEON, AVX-512), a 4x8 tile uses 32 registers as accumulators, fully utilizing the register file. The repacked weight format ensures that the data needed for one tile is laid out contiguously in memory, enabling sequential access patterns that are cache-friendly and avoid gather operations.

Compile-Time Size Verification

The repacked block structures use C++ templates (template <int K, int N> struct block) and static_assert statements to verify at compile time that the structure sizes match expectations. For example, block<4,4> (Q4_0 interleaved 4-wide) must have exactly 4 * sizeof(ggml_half) + QK8_0 * 2 bytes. This catches alignment or padding errors at compile time rather than at runtime, a critical safety measure for binary data formats.

Amortized Repacking Cost

Repacking has a one-time cost proportional to the model size. For a 7B-parameter model in Q4_0 format (~3.5 GB), repacking takes on the order of tens of milliseconds -- negligible compared to the total load time from disk. The per-inference benefit, however, is a sustained throughput improvement on every matrix multiplication operation (potentially hundreds per token), making the amortized cost per token essentially zero.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment