Implementation:Ggml org Llama cpp Llama Model Quantize For LoRA

Field	Value
Implementation Name	Llama Model Quantize For LoRA
Doc Type	API Doc
Workflow	LoRA_Adapter_Workflow
Step	5 of 5
Source Files	`include/llama.h`, `src/llama-quant.cpp`

Overview

Description

This implementation documents the llama_model_quantize API function as used in the post-merge LoRA quantization workflow. After merging LoRA adapters into a base model with llama-export-lora (which produces an F16 GGUF file), this function quantizes the merged model to a smaller, deployment-ready format.

The function is the same general-purpose quantization API used throughout llama.cpp, but in the LoRA workflow context it specifically operates on the F16 output from the merge step. The function reads the input GGUF file, applies the requested quantization scheme to each tensor, and writes a new quantized GGUF file.

Usage

# CLI usage via llama-quantize
./llama-quantize merged-model-f16.gguf quantized-model-q4_k_m.gguf Q4_K_M

Code Reference

Field	Value
Source Location (header)	`include/llama.h:614-617`
Source Location (impl)	`src/llama-quant.cpp:1057-1069`
Import	`#include "llama.h"`

API signature:

// Returns 0 on success
LLAMA_API uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params);

Implementation (src/llama-quant.cpp:1057-1069):

uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params) {
    try {
        llama_model_quantize_impl(fname_inp, fname_out, params);
    } catch (const std::exception & err) {
        LLAMA_LOG_ERROR("%s: failed to quantize: %s\n", __func__, err.what());
        return 1;
    }

    return 0;
}

Default quantization params (obtained via llama_model_quantize_default_params):

struct llama_model_quantize_params {
    int32_t  nthread;                // number of threads to use for quantizing
    enum llama_ftype ftype;          // quantize to this llama_ftype
    enum ggml_type output_tensor_type; // output tensor type
    enum ggml_type token_embedding_type; // token embeddings tensor type
    bool allow_requantize;           // allow quantizing non-f32/f16 tensors
    bool quantize_output_tensor;     // quantize output.weight
    bool only_copy;                  // only copy tensors (no quantization)
    bool pure;                       // quantize all tensors to the default type
    bool keep_split;                 // keep split info
    void * imatrix;                  // importance matrix data
    void * kv_overrides;             // KV overrides
    void * tensor_type;              // per-tensor type overrides
    void * prune_layers;             // layers to prune
};

I/O Contract

Direction	Name	Type	Description
Input	fname_inp	`const char *`	Path to the input GGUF file (F16 merged model from export-lora)
Input	fname_out	`const char *`	Path for the output quantized GGUF file
Input	params	`const llama_model_quantize_params *`	Quantization parameters (type, threading, etc.)
Output	(return)	`uint32_t`	0 on success, 1 on failure
Output	quantized GGUF file	binary file	Quantized model in the specified format

Typical quantization types for merged LoRA models:

LLAMA_FTYPE_MOSTLY_Q4_K_M: Good balance of quality and size
LLAMA_FTYPE_MOSTLY_Q5_K_M: Higher quality, recommended to preserve fine-tuning
LLAMA_FTYPE_MOSTLY_Q8_0: Near-lossless, best for preserving LoRA adaptations

Usage Examples

C API usage for post-merge quantization:

#include "llama.h"

// Get default parameters
llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
params.nthread = 8;

// Quantize the merged model
uint32_t result = llama_model_quantize(
    "merged-model-f16.gguf",
    "merged-model-q4_k_m.gguf",
    &params);

if (result != 0) {
    fprintf(stderr, "Quantization failed\n");
    return 1;
}

CLI usage (complete LoRA merge + quantize pipeline):

# Step 1: Convert LoRA adapter to GGUF
python convert_lora_to_gguf.py ./my-lora --base ./base-model

# Step 2: Merge LoRA into base model (produces F16)
./llama-export-lora -m base-model.gguf --lora my-lora/ggml-adapter-model.gguf -o merged-f16.gguf

# Step 3: Quantize the merged model
./llama-quantize merged-f16.gguf merged-q4_k_m.gguf Q4_K_M

# Step 4: Use the quantized merged model
./llama-cli -m merged-q4_k_m.gguf -p "Hello, world!"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment