Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml GGUF Tensor Serialization

From Leeroopedia
Revision as of 18:18, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Ggml_org_Ggml_GGUF_Tensor_Serialization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Page Type Principle
Knowledge Sources Repo (GGML)
Domains Model_Serialization, File_Format
Last Updated 2025-05-15 12:00 GMT

Overview

Serializing tensor data (weights, biases) into a binary file format with metadata, enabling efficient storage, distribution, and memory-mapped loading of machine learning model parameters in the GGUF container format.

Description

Tensor serialization in the GGUF format requires storing each tensor's name, number of dimensions, dimension sizes, data type, and raw data with proper alignment. The design separates tensor information (a lightweight directory of all tensors) from tensor data (the bulk binary payload), allowing readers to locate and load individual tensors without parsing the entire file.

Tensor Info Section

The tensor info section serves as a directory or table of contents for all tensors stored in the file. For each tensor, it stores:

  • Name: A string identifier (e.g., blk.0.attn_q.weight) used to match tensors to their role in a model architecture.
  • n_dims: The number of dimensions (1 for biases, 2 for weight matrices, etc.).
  • Dimensions (ne[]): An array of up to 4 dimension sizes describing the shape of the tensor.
  • Type: The GGML data type, which determines the encoding of the raw data. This supports all GGML types including full-precision formats (F32, F16) and quantized formats (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, and others).
  • Offset: A byte offset into the tensor data section, enabling random access to any tensor without loading all preceding data.

This structure allows a reader to inspect and selectively load tensors -- for instance, loading only the embedding layer or a specific transformer block -- without reading the entire file into memory.

Alignment

Tensor data is padded to a GGUF_DEFAULT_ALIGNMENT boundary (typically 32 bytes) to enable efficient memory-mapped access. When a GGUF file is memory-mapped, aligned data can be read directly by the CPU or GPU without costly unaligned memory operations. The alignment padding is inserted between the end of the metadata/tensor-info header and the start of the tensor data section, as well as implicitly through the offset calculation for each tensor.

Supported Data Types

The GGUF tensor serialization format supports all GGML tensor types, including:

  • Full-precision: F32, F16, BF16
  • Quantized (legacy): Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1
  • Quantized (k-quant): Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K
  • Special: I8, I16, I32, I64 (integer types)

Each quantized type encodes blocks of values with associated scale factors and offsets, achieving significant compression (e.g., Q4_0 uses approximately 4.5 bits per weight) with controlled accuracy loss.

Theoretical Basis

Binary Serialization with Metadata

Tensor serialization in GGUF follows a header-plus-payload pattern common in binary container formats. The file structure is:

  1. Header: Magic number, version, tensor count, metadata key-value count.
  2. Metadata key-value pairs: Arbitrary metadata (model architecture, vocabulary, hyperparameters).
  3. Tensor info array: One entry per tensor with name, shape, type, and offset.
  4. Alignment padding: Padding to the next alignment boundary.
  5. Tensor data: Concatenated raw tensor data, with each tensor's data located at its recorded offset.

This separation of metadata from data is a deliberate design choice that enables:

  • Streaming writes: Tensor info can be written first, then data appended sequentially.
  • Random access reads: Any tensor can be located by its offset without scanning the data section.
  • Memory mapping: The entire data section (or portions of it) can be memory-mapped for zero-copy loading.

Alignment for Memory-Mapped I/O

Memory-mapped file access performs best when data is aligned to natural boundaries. The GGUF_DEFAULT_ALIGNMENT value (32 bytes) is chosen to satisfy the alignment requirements of common SIMD instruction sets (SSE, AVX, AVX-512) and GPU memory access patterns. Aligned tensor data can be passed directly to GGML computation kernels without intermediate copies or realignment.

Usage

GGUF tensor serialization is applied when:

  • Saving quantized models: After converting a model from a training framework (PyTorch, TensorFlow) to GGML's quantized formats, the tensors are serialized into a GGUF file for distribution and deployment.
  • Model distribution: GGUF files are the standard distribution format for models used with llama.cpp and other GGML-based inference engines.
  • Memory-mapped inference: At inference time, the GGUF file can be memory-mapped so that tensor data is loaded on demand from disk, enabling models larger than available RAM to be served efficiently.
  • Incremental model loading: Using the tensor info directory, an application can selectively load specific layers or tensors, useful for layer-wise processing or model sharding.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment