Implementation:Ollama Ollama Llama Quant
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Quantization |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements model quantization, converting full-precision model tensors to lower-precision formats (Q4, Q5, Q8, IQ, etc.) for reduced memory usage and faster inference.
Description
Uses tensor_quantization to define per-tensor quantization type overrides. quantize_state_impl manages the quantization state across threads, tracking tensor indices for attention, FFN, and gate layers. The main quantization function reads tensors from the source GGUF file, determines the target quantization type based on the tensor's role, performs the quantization using ggml's quantization functions, and writes the quantized tensors to a new GGUF file. Supports importance matrix (imatrix) for better quantization quality, layer pruning via remap_layer, and multi-threaded parallel quantization.
Usage
Enables the quantization workflow that makes large models usable on consumer hardware. Ollama's model creation process uses this to convert models to quantized formats that fit in available memory while maintaining acceptable quality.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/llama-quant.cpp - Lines: 1-1072
Signature
struct tensor_quantization {
std::string name;
ggml_type quant = GGML_TYPE_COUNT;
};
struct quantize_state_impl {
const llama_model & model;
const llama_model_quantize_params * params;
int n_attention_wv, n_ffn_down, n_ffn_gate, n_ffn_up;
int i_attention_wv, i_ffn_down, i_ffn_gate, i_ffn_up;
int n_k_quantized, n_fallback;
bool has_imatrix, has_output;
};
static void llama_tensor_dequantize_impl(
ggml_tensor * tensor, std::vector<no_init<float>> & output,
std::vector<std::thread> & workers, const size_t nelements, const int nthread);
static std::string remap_layer(const std::string & orig_name,
const std::vector<int> & prune, std::map<int, std::string> & mapped, int & next_id);
Import
#include "llama-quant.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Source model to quantize |
| params | llama_model_quantize_params | Yes | Quantization configuration |
| fname_inp | const char * | Yes | Input GGUF file path |
| fname_out | const char * | Yes | Output GGUF file path |
Outputs
| Name | Type | Description |
|---|---|---|
| GGUF file | file | Quantized model saved to disk |
Usage Examples
// Quantization is invoked through the public API:
llama_model_quantize_params params = llama_model_quantize_default_params();
params.nthread = 8;
params.ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M;
llama_model_quantize("model-f16.gguf", "model-q4_k_m.gguf", ¶ms);