Implementation:Ggml org Llama cpp Llama Model Quantize Params

Field	Value
Implementation Name	Llama Model Quantize Params
Doc Type	API Doc
Topic	Model Quantization
Workflow	Model_Quantization
Category	Quantization Configuration
Repository	Ggml_org_Llama_cpp

Overview

Description

The llama_model_quantize_params struct and the llama_model_quantize_default_params() function together define the configuration interface for model quantization in llama.cpp. The struct contains all parameters that control quantization behavior, including the target quantization type (ftype), threading configuration, importance matrix data, and fine-grained tensor type overrides. The default params function returns a pre-initialized struct with sensible defaults (Q5_1 quantization, auto-detected thread count, no importance matrix).

The llama_ftype enum defines the complete set of supported quantization types, from full-precision F32 down to 1-bit ternary representations.

Usage

Users obtain a default params struct, modify the fields they need to customize, and pass the struct to llama_model_quantize(). This pattern allows the API to evolve with new parameters while maintaining backward compatibility through defaults.

Code Reference

Source Location

Params struct: include/llama.h (lines 382-396)
Default params function: include/llama.h (line 421), implemented in src/llama-quant.cpp (lines 1037-1055)
Ftype enum: include/llama.h (lines 115-157)

Signature

// Quantization parameters struct
typedef struct llama_model_quantize_params {
    int32_t nthread;                      // number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency()
    enum llama_ftype ftype;               // quantize to this llama_ftype
    enum ggml_type output_tensor_type;    // output tensor type
    enum ggml_type token_embedding_type;  // token embeddings tensor type
    bool allow_requantize;                // allow quantizing non-f32/f16 tensors
    bool quantize_output_tensor;          // quantize output.weight
    bool only_copy;                       // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored
    bool pure;                            // quantize all tensors to the default type
    bool keep_split;                      // quantize to the same number of shards
    void * imatrix;                       // pointer to importance matrix data
    void * kv_overrides;                  // pointer to vector containing overrides
    void * tensor_types;                  // pointer to vector containing tensor types
    void * prune_layers;                  // pointer to vector containing layer indices to prune
} llama_model_quantize_params;

// Default params factory function
LLAMA_API struct llama_model_quantize_params llama_model_quantize_default_params(void);

The llama_ftype enum defines available quantization types:

enum llama_ftype {
    LLAMA_FTYPE_ALL_F32              = 0,
    LLAMA_FTYPE_MOSTLY_F16           = 1,   // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q4_0          = 2,   // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q4_1          = 3,   // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q8_0          = 7,   // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q5_0          = 8,   // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q5_1          = 9,   // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q2_K          = 10,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q3_K_S        = 11,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q3_K_M        = 12,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q3_K_L        = 13,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q4_K_S        = 14,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q4_K_M        = 15,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q5_K_S        = 16,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q5_K_M        = 17,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q6_K          = 18,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ2_XXS       = 19,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ2_XS        = 20,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_Q2_K_S        = 21,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ3_XS        = 22,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ3_XXS       = 23,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ1_S         = 24,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ4_NL        = 25,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ3_S         = 26,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ3_M         = 27,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ2_S         = 28,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ2_M         = 29,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ4_XS        = 30,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_IQ1_M         = 31,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_BF16          = 32,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_TQ1_0         = 36,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_TQ2_0         = 37,  // except 1d tensors
    LLAMA_FTYPE_MOSTLY_MXFP4_MOE     = 38,  // except 1d tensors
    LLAMA_FTYPE_GUESSED = 1024,             // not specified in the model file
};

Import

#include "llama.h"

I/O Contract

Direction	Type	Description
Input (ftype)	`enum llama_ftype`	Target quantization type determining bits-per-weight and encoding scheme
Input (nthread)	`int32_t`	Thread count for parallel quantization; 0 or negative for auto-detection via `std::thread::hardware_concurrency()`
Input (output_tensor_type)	`enum ggml_type`	Override type for the output tensor; `GGML_TYPE_COUNT` means use default
Input (token_embedding_type)	`enum ggml_type`	Override type for token embeddings; `GGML_TYPE_COUNT` means use default
Input (allow_requantize)	`bool`	When true, allows quantizing tensors that are already in a non-F32/F16 format
Input (quantize_output_tensor)	`bool`	When true, the output.weight tensor is quantized along with other tensors
Input (only_copy)	`bool`	When true, tensors are copied without quantization (format conversion only)
Input (pure)	`bool`	When true, all tensors are quantized to the same default type (no mixed precision)
Input (keep_split)	`bool`	When true, the output retains the same number of file shards as the input
Input (imatrix)	`void *`	Pointer to importance matrix data for importance-weighted quantization; nullptr to disable
Input (kv_overrides)	`void *`	Pointer to vector of key-value metadata overrides
Input (tensor_types)	`void *`	Pointer to vector of per-tensor type overrides
Input (prune_layers)	`void *`	Pointer to vector of layer indices to prune during quantization
Output	`llama_model_quantize_params`	Initialized params struct (from `llama_model_quantize_default_params`)

Default values returned by llama_model_quantize_default_params():

llama_model_quantize_params result = {
    /*.nthread                     =*/ 0,
    /*.ftype                       =*/ LLAMA_FTYPE_MOSTLY_Q5_1,
    /*.output_tensor_type          =*/ GGML_TYPE_COUNT,
    /*.token_embedding_type        =*/ GGML_TYPE_COUNT,
    /*.allow_requantize            =*/ false,
    /*.quantize_output_tensor      =*/ true,
    /*.only_copy                   =*/ false,
    /*.pure                        =*/ false,
    /*.keep_split                  =*/ false,
    /*.imatrix                     =*/ nullptr,
    /*.kv_overrides                =*/ nullptr,
    /*.tensor_type                 =*/ nullptr,
    /*.prune_layers                =*/ nullptr
};

Usage Examples

Example 1: Quantize with default params (Q5_1)

#include "llama.h"

llama_model_quantize_params params = llama_model_quantize_default_params();
uint32_t result = llama_model_quantize("model-f16.gguf", "model-q5_1.gguf", &params);
if (result != 0) {
    fprintf(stderr, "Quantization failed\n");
}

Example 2: Quantize to Q4_K_M with custom thread count

llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
params.nthread = 8;
llama_model_quantize("model-f16.gguf", "model-q4_k_m.gguf", &params);

Example 3: Quantize with importance matrix

llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_IQ4_XS;
params.imatrix = (void *)&imatrix_data;  // previously loaded importance matrix
llama_model_quantize("model-f16.gguf", "model-iq4_xs.gguf", &params);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment