Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Model Quantize For LoRA

From Leeroopedia
Field Value
Implementation Name Llama Model Quantize For LoRA
Doc Type API Doc
Workflow LoRA_Adapter_Workflow
Step 5 of 5
Source Files include/llama.h, src/llama-quant.cpp

Overview

Description

This implementation documents the llama_model_quantize API function as used in the post-merge LoRA quantization workflow. After merging LoRA adapters into a base model with llama-export-lora (which produces an F16 GGUF file), this function quantizes the merged model to a smaller, deployment-ready format.

The function is the same general-purpose quantization API used throughout llama.cpp, but in the LoRA workflow context it specifically operates on the F16 output from the merge step. The function reads the input GGUF file, applies the requested quantization scheme to each tensor, and writes a new quantized GGUF file.

Usage

# CLI usage via llama-quantize
./llama-quantize merged-model-f16.gguf quantized-model-q4_k_m.gguf Q4_K_M

Code Reference

Field Value
Source Location (header) include/llama.h:614-617
Source Location (impl) src/llama-quant.cpp:1057-1069
Import #include "llama.h"

API signature:

// Returns 0 on success
LLAMA_API uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params);

Implementation (src/llama-quant.cpp:1057-1069):

uint32_t llama_model_quantize(
        const char * fname_inp,
        const char * fname_out,
        const llama_model_quantize_params * params) {
    try {
        llama_model_quantize_impl(fname_inp, fname_out, params);
    } catch (const std::exception & err) {
        LLAMA_LOG_ERROR("%s: failed to quantize: %s\n", __func__, err.what());
        return 1;
    }

    return 0;
}

Default quantization params (obtained via llama_model_quantize_default_params):

struct llama_model_quantize_params {
    int32_t  nthread;                // number of threads to use for quantizing
    enum llama_ftype ftype;          // quantize to this llama_ftype
    enum ggml_type output_tensor_type; // output tensor type
    enum ggml_type token_embedding_type; // token embeddings tensor type
    bool allow_requantize;           // allow quantizing non-f32/f16 tensors
    bool quantize_output_tensor;     // quantize output.weight
    bool only_copy;                  // only copy tensors (no quantization)
    bool pure;                       // quantize all tensors to the default type
    bool keep_split;                 // keep split info
    void * imatrix;                  // importance matrix data
    void * kv_overrides;             // KV overrides
    void * tensor_type;              // per-tensor type overrides
    void * prune_layers;             // layers to prune
};

I/O Contract

Direction Name Type Description
Input fname_inp const char * Path to the input GGUF file (F16 merged model from export-lora)
Input fname_out const char * Path for the output quantized GGUF file
Input params const llama_model_quantize_params * Quantization parameters (type, threading, etc.)
Output (return) uint32_t 0 on success, 1 on failure
Output quantized GGUF file binary file Quantized model in the specified format

Typical quantization types for merged LoRA models:

  • LLAMA_FTYPE_MOSTLY_Q4_K_M: Good balance of quality and size
  • LLAMA_FTYPE_MOSTLY_Q5_K_M: Higher quality, recommended to preserve fine-tuning
  • LLAMA_FTYPE_MOSTLY_Q8_0: Near-lossless, best for preserving LoRA adaptations

Usage Examples

C API usage for post-merge quantization:

#include "llama.h"

// Get default parameters
llama_model_quantize_params params = llama_model_quantize_default_params();
params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
params.nthread = 8;

// Quantize the merged model
uint32_t result = llama_model_quantize(
    "merged-model-f16.gguf",
    "merged-model-q4_k_m.gguf",
    &params);

if (result != 0) {
    fprintf(stderr, "Quantization failed\n");
    return 1;
}

CLI usage (complete LoRA merge + quantize pipeline):

# Step 1: Convert LoRA adapter to GGUF
python convert_lora_to_gguf.py ./my-lora --base ./base-model

# Step 2: Merge LoRA into base model (produces F16)
./llama-export-lora -m base-model.gguf --lora my-lora/ggml-adapter-model.gguf -o merged-f16.gguf

# Step 3: Quantize the merged model
./llama-quantize merged-f16.gguf merged-q4_k_m.gguf Q4_K_M

# Step 4: Use the quantized merged model
./llama-cli -m merged-q4_k_m.gguf -p "Hello, world!"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment