Implementation:Ggml org Llama cpp Llama Model Quantize For LoRA
| Field | Value |
|---|---|
| Implementation Name | Llama Model Quantize For LoRA |
| Doc Type | API Doc |
| Workflow | LoRA_Adapter_Workflow |
| Step | 5 of 5 |
| Source Files | include/llama.h, src/llama-quant.cpp
|
Overview
Description
This implementation documents the llama_model_quantize API function as used in the post-merge LoRA quantization workflow. After merging LoRA adapters into a base model with llama-export-lora (which produces an F16 GGUF file), this function quantizes the merged model to a smaller, deployment-ready format.
The function is the same general-purpose quantization API used throughout llama.cpp, but in the LoRA workflow context it specifically operates on the F16 output from the merge step. The function reads the input GGUF file, applies the requested quantization scheme to each tensor, and writes a new quantized GGUF file.
Usage
# CLI usage via llama-quantize
./llama-quantize merged-model-f16.gguf quantized-model-q4_k_m.gguf Q4_K_M
Code Reference
| Field | Value |
|---|---|
| Source Location (header) | include/llama.h:614-617
|
| Source Location (impl) | src/llama-quant.cpp:1057-1069
|
| Import | #include "llama.h"
|
API signature:
// Returns 0 on success
LLAMA_API uint32_t llama_model_quantize(
const char * fname_inp,
const char * fname_out,
const llama_model_quantize_params * params);
Implementation (src/llama-quant.cpp:1057-1069):
uint32_t llama_model_quantize(
const char * fname_inp,
const char * fname_out,
const llama_model_quantize_params * params) {
try {
llama_model_quantize_impl(fname_inp, fname_out, params);
} catch (const std::exception & err) {
LLAMA_LOG_ERROR("%s: failed to quantize: %s\n", __func__, err.what());
return 1;
}
return 0;
}
Default quantization params (obtained via llama_model_quantize_default_params):
struct llama_model_quantize_params {
int32_t nthread; // number of threads to use for quantizing
enum llama_ftype ftype; // quantize to this llama_ftype
enum ggml_type output_tensor_type; // output tensor type
enum ggml_type token_embedding_type; // token embeddings tensor type
bool allow_requantize; // allow quantizing non-f32/f16 tensors
bool quantize_output_tensor; // quantize output.weight
bool only_copy; // only copy tensors (no quantization)
bool pure; // quantize all tensors to the default type
bool keep_split; // keep split info
void * imatrix; // importance matrix data
void * kv_overrides; // KV overrides
void * tensor_type; // per-tensor type overrides
void * prune_layers; // layers to prune
};
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | fname_inp | const char * |
Path to the input GGUF file (F16 merged model from export-lora) |
| Input | fname_out | const char * |
Path for the output quantized GGUF file |
| Input | params | const llama_model_quantize_params * |
Quantization parameters (type, threading, etc.) |
| Output | (return) | uint32_t |
0 on success, 1 on failure |
| Output | quantized GGUF file | binary file | Quantized model in the specified format |
Typical quantization types for merged LoRA models:
LLAMA_FTYPE_MOSTLY_Q4_K_M: Good balance of quality and sizeLLAMA_FTYPE_MOSTLY_Q5_K_M: Higher quality, recommended to preserve fine-tuningLLAMA_FTYPE_MOSTLY_Q8_0: Near-lossless, best for preserving LoRA adaptations
Usage Examples
C API usage for post-merge quantization:
#include "llama.h"
// Get default parameters
llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
params.nthread = 8;
// Quantize the merged model
uint32_t result = llama_model_quantize(
"merged-model-f16.gguf",
"merged-model-q4_k_m.gguf",
¶ms);
if (result != 0) {
fprintf(stderr, "Quantization failed\n");
return 1;
}
CLI usage (complete LoRA merge + quantize pipeline):
# Step 1: Convert LoRA adapter to GGUF
python convert_lora_to_gguf.py ./my-lora --base ./base-model
# Step 2: Merge LoRA into base model (produces F16)
./llama-export-lora -m base-model.gguf --lora my-lora/ggml-adapter-model.gguf -o merged-f16.gguf
# Step 3: Quantize the merged model
./llama-quantize merged-f16.gguf merged-q4_k_m.gguf Q4_K_M
# Step 4: Use the quantized merged model
./llama-cli -m merged-q4_k_m.gguf -p "Hello, world!"