Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Llama Model Quantize

From Leeroopedia
Revision as of 12:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Llama_Model_Quantize.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Field Value
Implementation Name Llama Model Quantize
Doc Type API Doc
Topic Model Quantization
Workflow Model_Quantization
Category Core Quantization
Repository Ggml_org_Llama_cpp

Overview

Description

The llama_model_quantize() function is the public API entry point for quantizing GGUF model files. It accepts input and output file paths along with a configuration struct, and performs the full quantization pipeline: loading the source model, determining per-tensor quantization types based on the selected ftype and mixed-precision rules, applying block-wise quantization with optional importance matrix weighting, and writing the quantized tensors to a new GGUF file. The function delegates to llama_model_quantize_impl(), which contains the full implementation spanning approximately 550 lines of tensor iteration, type mapping, and parallel quantization logic.

Usage

This function is called by the llama-quantize command-line tool and can be called directly from C/C++ applications that need programmatic model quantization.

Code Reference

Source Location

  • Public API entry: src/llama-quant.cpp (lines 1057-1069)
  • Implementation: src/llama-quant.cpp (lines 482-1036) -- llama_model_quantize_impl()
  • Header declaration: include/llama.h (lines 613-617)

Signature

// Public API (include/llama.h)
// Returns 0 on success
LLAMA_API uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params);

Internal implementation:

// src/llama-quant.cpp
uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params) {
    try {
        llama_model_quantize_impl(fname_inp, fname_out, params);
    } catch (const std::exception & err) {
        LLAMA_LOG_ERROR("%s: failed to quantize: %s\n", __func__, err.what());
        return 1;
    }
    return 0;
}

// Internal implementation
static void llama_model_quantize_impl(
        const std::string & fname_inp,
        const std::string & fname_out,
        const llama_model_quantize_params * params);

Import

#include "llama.h"

I/O Contract

Direction Type Description
Input (fname_inp) const char * Path to the source GGUF model file (typically F32 or F16 precision)
Input (fname_out) const char * Path for the output quantized GGUF model file
Input (params) const llama_model_quantize_params * Configuration struct specifying quantization type, threading, importance matrix, and other options
Output uint32_t Return code: 0 on success, 1 on failure
Side Effect File system Creates a new GGUF file at fname_out containing quantized model weights

Internal processing pipeline (within llama_model_quantize_impl):

  1. Type resolution -- Maps llama_ftype to ggml_type via a switch statement covering all 30+ quantization types
  2. Model loading -- Opens the input GGUF file using llama_model_loader with memory mapping (on Linux/Windows)
  3. Architecture parsing -- Loads model hyperparameters to determine tensor roles and mixed-precision assignments
  4. Importance matrix validation -- If an imatrix is provided, validates all values are finite
  5. GGUF output setup -- Initializes the output context, copies metadata, updates file type and quantization version
  6. Tensor iteration -- For each tensor in the model:
    • Determines the target quantization type based on tensor name, role, and ftype rules
    • Loads the tensor data from the input file
    • Applies quantization using the ggml quantization functions with optional importance weighting
    • Validates the quantized data via dequantization check
    • Writes the quantized tensor to the output file
  7. Finalization -- Closes the output file and reports compression statistics

Usage Examples

Example 1: Basic quantization to Q4_K_M

#include "llama.h"

int main() {
    llama_model_quantize_params params = llama_model_quantize_default_params();
    params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
    params.nthread = 8;

    uint32_t result = llama_model_quantize(
        "models/llama-7b-f16.gguf",
        "models/llama-7b-q4_k_m.gguf",
        &params
    );

    if (result != 0) {
        fprintf(stderr, "Quantization failed\n");
        return 1;
    }
    printf("Quantization completed successfully\n");
    return 0;
}

Example 2: Quantization with importance matrix

#include "llama.h"
#include <unordered_map>
#include <vector>

// Assume imatrix_data has been loaded from a previously generated imatrix file
std::unordered_map<std::string, std::vector<float>> imatrix_data;
// ... load imatrix_data ...

llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_IQ4_XS;
params.imatrix = &imatrix_data;

uint32_t result = llama_model_quantize(
    "model-f16.gguf",
    "model-iq4_xs.gguf",
    &params
);

Example 3: Copy-only mode (format conversion without quantization)

llama_model_quantize_params params = llama_model_quantize_default_params();
params.only_copy = true;

llama_model_quantize("model-split-00001.gguf", "model-merged.gguf", &params);

Example 4: Command-line usage via llama-quantize

# Basic quantization
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# With importance matrix
./llama-quantize --imatrix imatrix.gguf model-f16.gguf model-iq3_m.gguf IQ3_M

# With custom thread count
./llama-quantize --nthread 16 model-f16.gguf model-q8_0.gguf Q8_0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment