Implementation:Ggml org Llama cpp Llama Model Quantize Params
| Field | Value |
|---|---|
| Implementation Name | Llama Model Quantize Params |
| Doc Type | API Doc |
| Topic | Model Quantization |
| Workflow | Model_Quantization |
| Category | Quantization Configuration |
| Repository | Ggml_org_Llama_cpp |
Overview
Description
The llama_model_quantize_params struct and the llama_model_quantize_default_params() function together define the configuration interface for model quantization in llama.cpp. The struct contains all parameters that control quantization behavior, including the target quantization type (ftype), threading configuration, importance matrix data, and fine-grained tensor type overrides. The default params function returns a pre-initialized struct with sensible defaults (Q5_1 quantization, auto-detected thread count, no importance matrix).
The llama_ftype enum defines the complete set of supported quantization types, from full-precision F32 down to 1-bit ternary representations.
Usage
Users obtain a default params struct, modify the fields they need to customize, and pass the struct to llama_model_quantize(). This pattern allows the API to evolve with new parameters while maintaining backward compatibility through defaults.
Code Reference
Source Location
- Params struct:
include/llama.h(lines 382-396) - Default params function:
include/llama.h(line 421), implemented insrc/llama-quant.cpp(lines 1037-1055) - Ftype enum:
include/llama.h(lines 115-157)
Signature
// Quantization parameters struct
typedef struct llama_model_quantize_params {
int32_t nthread; // number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency()
enum llama_ftype ftype; // quantize to this llama_ftype
enum ggml_type output_tensor_type; // output tensor type
enum ggml_type token_embedding_type; // token embeddings tensor type
bool allow_requantize; // allow quantizing non-f32/f16 tensors
bool quantize_output_tensor; // quantize output.weight
bool only_copy; // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored
bool pure; // quantize all tensors to the default type
bool keep_split; // quantize to the same number of shards
void * imatrix; // pointer to importance matrix data
void * kv_overrides; // pointer to vector containing overrides
void * tensor_types; // pointer to vector containing tensor types
void * prune_layers; // pointer to vector containing layer indices to prune
} llama_model_quantize_params;
// Default params factory function
LLAMA_API struct llama_model_quantize_params llama_model_quantize_default_params(void);
The llama_ftype enum defines available quantization types:
enum llama_ftype {
LLAMA_FTYPE_ALL_F32 = 0,
LLAMA_FTYPE_MOSTLY_F16 = 1, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_0 = 2, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_1 = 3, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q8_0 = 7, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q5_0 = 8, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q5_1 = 9, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q2_K = 10, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_S = 11, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_M = 12, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q3_K_L = 13, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_K_S = 14, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_K_M = 15, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q5_K_S = 16, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q5_K_M = 17, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q6_K = 18, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ2_XXS = 19, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ2_XS = 20, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q2_K_S = 21, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ3_XS = 22, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ3_XXS = 23, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ1_S = 24, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ4_NL = 25, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ3_S = 26, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ3_M = 27, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ2_S = 28, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ2_M = 29, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ4_XS = 30, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ1_M = 31, // except 1d tensors
LLAMA_FTYPE_MOSTLY_BF16 = 32, // except 1d tensors
LLAMA_FTYPE_MOSTLY_TQ1_0 = 36, // except 1d tensors
LLAMA_FTYPE_MOSTLY_TQ2_0 = 37, // except 1d tensors
LLAMA_FTYPE_MOSTLY_MXFP4_MOE = 38, // except 1d tensors
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
};
Import
#include "llama.h"
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input (ftype) | enum llama_ftype |
Target quantization type determining bits-per-weight and encoding scheme |
| Input (nthread) | int32_t |
Thread count for parallel quantization; 0 or negative for auto-detection via std::thread::hardware_concurrency()
|
| Input (output_tensor_type) | enum ggml_type |
Override type for the output tensor; GGML_TYPE_COUNT means use default
|
| Input (token_embedding_type) | enum ggml_type |
Override type for token embeddings; GGML_TYPE_COUNT means use default
|
| Input (allow_requantize) | bool |
When true, allows quantizing tensors that are already in a non-F32/F16 format |
| Input (quantize_output_tensor) | bool |
When true, the output.weight tensor is quantized along with other tensors |
| Input (only_copy) | bool |
When true, tensors are copied without quantization (format conversion only) |
| Input (pure) | bool |
When true, all tensors are quantized to the same default type (no mixed precision) |
| Input (keep_split) | bool |
When true, the output retains the same number of file shards as the input |
| Input (imatrix) | void * |
Pointer to importance matrix data for importance-weighted quantization; nullptr to disable |
| Input (kv_overrides) | void * |
Pointer to vector of key-value metadata overrides |
| Input (tensor_types) | void * |
Pointer to vector of per-tensor type overrides |
| Input (prune_layers) | void * |
Pointer to vector of layer indices to prune during quantization |
| Output | llama_model_quantize_params |
Initialized params struct (from llama_model_quantize_default_params)
|
Default values returned by llama_model_quantize_default_params():
llama_model_quantize_params result = {
/*.nthread =*/ 0,
/*.ftype =*/ LLAMA_FTYPE_MOSTLY_Q5_1,
/*.output_tensor_type =*/ GGML_TYPE_COUNT,
/*.token_embedding_type =*/ GGML_TYPE_COUNT,
/*.allow_requantize =*/ false,
/*.quantize_output_tensor =*/ true,
/*.only_copy =*/ false,
/*.pure =*/ false,
/*.keep_split =*/ false,
/*.imatrix =*/ nullptr,
/*.kv_overrides =*/ nullptr,
/*.tensor_type =*/ nullptr,
/*.prune_layers =*/ nullptr
};
Usage Examples
Example 1: Quantize with default params (Q5_1)
#include "llama.h"
llama_model_quantize_params params = llama_model_quantize_default_params();
uint32_t result = llama_model_quantize("model-f16.gguf", "model-q5_1.gguf", ¶ms);
if (result != 0) {
fprintf(stderr, "Quantization failed\n");
}
Example 2: Quantize to Q4_K_M with custom thread count
llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
params.nthread = 8;
llama_model_quantize("model-f16.gguf", "model-q4_k_m.gguf", ¶ms);
Example 3: Quantize with importance matrix
llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_IQ4_XS;
params.imatrix = (void *)&imatrix_data; // previously loaded importance matrix
llama_model_quantize("model-f16.gguf", "model-iq4_xs.gguf", ¶ms);