Implementation:Ggml org Llama cpp Llama Model Quantize

Field	Value
Implementation Name	Llama Model Quantize
Doc Type	API Doc
Topic	Model Quantization
Workflow	Model_Quantization
Category	Core Quantization
Repository	Ggml_org_Llama_cpp

Overview

Description

The llama_model_quantize() function is the public API entry point for quantizing GGUF model files. It accepts input and output file paths along with a configuration struct, and performs the full quantization pipeline: loading the source model, determining per-tensor quantization types based on the selected ftype and mixed-precision rules, applying block-wise quantization with optional importance matrix weighting, and writing the quantized tensors to a new GGUF file. The function delegates to llama_model_quantize_impl(), which contains the full implementation spanning approximately 550 lines of tensor iteration, type mapping, and parallel quantization logic.

Usage

This function is called by the llama-quantize command-line tool and can be called directly from C/C++ applications that need programmatic model quantization.

Code Reference

Source Location

Public API entry: src/llama-quant.cpp (lines 1057-1069)
Implementation: src/llama-quant.cpp (lines 482-1036) -- llama_model_quantize_impl()
Header declaration: include/llama.h (lines 613-617)

Signature

// Public API (include/llama.h)
// Returns 0 on success
LLAMA_API uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params);

Internal implementation:

// src/llama-quant.cpp
uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params) {
    try {
        llama_model_quantize_impl(fname_inp, fname_out, params);
    } catch (const std::exception & err) {
        LLAMA_LOG_ERROR("%s: failed to quantize: %s\n", __func__, err.what());
        return 1;
    }
    return 0;
}

// Internal implementation
static void llama_model_quantize_impl(
        const std::string & fname_inp,
        const std::string & fname_out,
        const llama_model_quantize_params * params);

Import

#include "llama.h"

I/O Contract

Direction	Type	Description
Input (fname_inp)	`const char *`	Path to the source GGUF model file (typically F32 or F16 precision)
Input (fname_out)	`const char *`	Path for the output quantized GGUF model file
Input (params)	`const llama_model_quantize_params *`	Configuration struct specifying quantization type, threading, importance matrix, and other options
Output	`uint32_t`	Return code: 0 on success, 1 on failure
Side Effect	File system	Creates a new GGUF file at `fname_out` containing quantized model weights

Internal processing pipeline (within llama_model_quantize_impl):

Type resolution -- Maps llama_ftype to ggml_type via a switch statement covering all 30+ quantization types
Model loading -- Opens the input GGUF file using llama_model_loader with memory mapping (on Linux/Windows)
Architecture parsing -- Loads model hyperparameters to determine tensor roles and mixed-precision assignments
Importance matrix validation -- If an imatrix is provided, validates all values are finite
GGUF output setup -- Initializes the output context, copies metadata, updates file type and quantization version
Tensor iteration -- For each tensor in the model:
- Determines the target quantization type based on tensor name, role, and ftype rules
- Loads the tensor data from the input file
- Applies quantization using the ggml quantization functions with optional importance weighting
- Validates the quantized data via dequantization check
- Writes the quantized tensor to the output file
Finalization -- Closes the output file and reports compression statistics

Usage Examples

Example 1: Basic quantization to Q4_K_M

#include "llama.h"

int main() {
    llama_model_quantize_params params = llama_model_quantize_default_params();
    params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
    params.nthread = 8;

    uint32_t result = llama_model_quantize(
        "models/llama-7b-f16.gguf",
        "models/llama-7b-q4_k_m.gguf",
        &params
    );

    if (result != 0) {
        fprintf(stderr, "Quantization failed\n");
        return 1;
    }
    printf("Quantization completed successfully\n");
    return 0;
}

Example 2: Quantization with importance matrix

#include "llama.h"
#include <unordered_map>
#include <vector>

// Assume imatrix_data has been loaded from a previously generated imatrix file
std::unordered_map<std::string, std::vector<float>> imatrix_data;
// ... load imatrix_data ...

llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_IQ4_XS;
params.imatrix = &imatrix_data;

uint32_t result = llama_model_quantize(
    "model-f16.gguf",
    "model-iq4_xs.gguf",
    &params
);

Example 3: Copy-only mode (format conversion without quantization)

llama_model_quantize_params params = llama_model_quantize_default_params();
params.only_copy = true;

llama_model_quantize("model-split-00001.gguf", "model-merged.gguf", &params);

Example 4: Command-line usage via llama-quantize

# Basic quantization
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# With importance matrix
./llama-quantize --imatrix imatrix.gguf model-f16.gguf model-iq3_m.gguf IQ3_M

# With custom thread count
./llama-quantize --nthread 16 model-f16.gguf model-q8_0.gguf Q8_0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment