Implementation:Ggml org Llama cpp Llama Model Quantize
| Field | Value |
|---|---|
| Implementation Name | Llama Model Quantize |
| Doc Type | API Doc |
| Topic | Model Quantization |
| Workflow | Model_Quantization |
| Category | Core Quantization |
| Repository | Ggml_org_Llama_cpp |
Overview
Description
The llama_model_quantize() function is the public API entry point for quantizing GGUF model files. It accepts input and output file paths along with a configuration struct, and performs the full quantization pipeline: loading the source model, determining per-tensor quantization types based on the selected ftype and mixed-precision rules, applying block-wise quantization with optional importance matrix weighting, and writing the quantized tensors to a new GGUF file. The function delegates to llama_model_quantize_impl(), which contains the full implementation spanning approximately 550 lines of tensor iteration, type mapping, and parallel quantization logic.
Usage
This function is called by the llama-quantize command-line tool and can be called directly from C/C++ applications that need programmatic model quantization.
Code Reference
Source Location
- Public API entry:
src/llama-quant.cpp(lines 1057-1069) - Implementation:
src/llama-quant.cpp(lines 482-1036) --llama_model_quantize_impl() - Header declaration:
include/llama.h(lines 613-617)
Signature
// Public API (include/llama.h)
// Returns 0 on success
LLAMA_API uint32_t llama_model_quantize(
const char * fname_inp,
const char * fname_out,
const llama_model_quantize_params * params);
Internal implementation:
// src/llama-quant.cpp
uint32_t llama_model_quantize(
const char * fname_inp,
const char * fname_out,
const llama_model_quantize_params * params) {
try {
llama_model_quantize_impl(fname_inp, fname_out, params);
} catch (const std::exception & err) {
LLAMA_LOG_ERROR("%s: failed to quantize: %s\n", __func__, err.what());
return 1;
}
return 0;
}
// Internal implementation
static void llama_model_quantize_impl(
const std::string & fname_inp,
const std::string & fname_out,
const llama_model_quantize_params * params);
Import
#include "llama.h"
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input (fname_inp) | const char * |
Path to the source GGUF model file (typically F32 or F16 precision) |
| Input (fname_out) | const char * |
Path for the output quantized GGUF model file |
| Input (params) | const llama_model_quantize_params * |
Configuration struct specifying quantization type, threading, importance matrix, and other options |
| Output | uint32_t |
Return code: 0 on success, 1 on failure |
| Side Effect | File system | Creates a new GGUF file at fname_out containing quantized model weights
|
Internal processing pipeline (within llama_model_quantize_impl):
- Type resolution -- Maps
llama_ftypetoggml_typevia a switch statement covering all 30+ quantization types - Model loading -- Opens the input GGUF file using
llama_model_loaderwith memory mapping (on Linux/Windows) - Architecture parsing -- Loads model hyperparameters to determine tensor roles and mixed-precision assignments
- Importance matrix validation -- If an imatrix is provided, validates all values are finite
- GGUF output setup -- Initializes the output context, copies metadata, updates file type and quantization version
- Tensor iteration -- For each tensor in the model:
- Determines the target quantization type based on tensor name, role, and ftype rules
- Loads the tensor data from the input file
- Applies quantization using the ggml quantization functions with optional importance weighting
- Validates the quantized data via dequantization check
- Writes the quantized tensor to the output file
- Finalization -- Closes the output file and reports compression statistics
Usage Examples
Example 1: Basic quantization to Q4_K_M
#include "llama.h"
int main() {
llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
params.nthread = 8;
uint32_t result = llama_model_quantize(
"models/llama-7b-f16.gguf",
"models/llama-7b-q4_k_m.gguf",
¶ms
);
if (result != 0) {
fprintf(stderr, "Quantization failed\n");
return 1;
}
printf("Quantization completed successfully\n");
return 0;
}
Example 2: Quantization with importance matrix
#include "llama.h"
#include <unordered_map>
#include <vector>
// Assume imatrix_data has been loaded from a previously generated imatrix file
std::unordered_map<std::string, std::vector<float>> imatrix_data;
// ... load imatrix_data ...
llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_IQ4_XS;
params.imatrix = &imatrix_data;
uint32_t result = llama_model_quantize(
"model-f16.gguf",
"model-iq4_xs.gguf",
¶ms
);
Example 3: Copy-only mode (format conversion without quantization)
llama_model_quantize_params params = llama_model_quantize_default_params();
params.only_copy = true;
llama_model_quantize("model-split-00001.gguf", "model-merged.gguf", ¶ms);
Example 4: Command-line usage via llama-quantize
# Basic quantization
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# With importance matrix
./llama-quantize --imatrix imatrix.gguf model-f16.gguf model-iq3_m.gguf IQ3_M
# With custom thread count
./llama-quantize --nthread 16 model-f16.gguf model-q8_0.gguf Q8_0