Implementation:Ollama Ollama Llama Quant

Knowledge Sources	Ollama
Domains	LLM Inference, Quantization
Last Updated	2025-02-15 00:00 GMT

Overview

Implements model quantization, converting full-precision model tensors to lower-precision formats (Q4, Q5, Q8, IQ, etc.) for reduced memory usage and faster inference.

Description

Uses tensor_quantization to define per-tensor quantization type overrides. quantize_state_impl manages the quantization state across threads, tracking tensor indices for attention, FFN, and gate layers. The main quantization function reads tensors from the source GGUF file, determines the target quantization type based on the tensor's role, performs the quantization using ggml's quantization functions, and writes the quantized tensors to a new GGUF file. Supports importance matrix (imatrix) for better quantization quality, layer pruning via remap_layer, and multi-threaded parallel quantization.

Usage

Enables the quantization workflow that makes large models usable on consumer hardware. Ollama's model creation process uses this to convert models to quantized formats that fit in available memory while maintaining acceptable quality.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-quant.cpp
Lines: 1-1072

Signature

struct tensor_quantization {
    std::string name;
    ggml_type quant = GGML_TYPE_COUNT;
};

struct quantize_state_impl {
    const llama_model                 & model;
    const llama_model_quantize_params * params;
    int n_attention_wv, n_ffn_down, n_ffn_gate, n_ffn_up;
    int i_attention_wv, i_ffn_down, i_ffn_gate, i_ffn_up;
    int n_k_quantized, n_fallback;
    bool has_imatrix, has_output;
};

static void llama_tensor_dequantize_impl(
    ggml_tensor * tensor, std::vector<no_init<float>> & output,
    std::vector<std::thread> & workers, const size_t nelements, const int nthread);

static std::string remap_layer(const std::string & orig_name,
    const std::vector<int> & prune, std::map<int, std::string> & mapped, int & next_id);

Import

#include "llama-quant.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Source model to quantize
params	llama_model_quantize_params	Yes	Quantization configuration
fname_inp	const char *	Yes	Input GGUF file path
fname_out	const char *	Yes	Output GGUF file path

Outputs

Name	Type	Description
GGUF file	file	Quantized model saved to disk

Usage Examples

// Quantization is invoked through the public API:
llama_model_quantize_params params = llama_model_quantize_default_params();
params.nthread = 8;
params.ftype   = LLAMA_FTYPE_MOSTLY_Q4_K_M;

llama_model_quantize("model-f16.gguf", "model-q4_k_m.gguf", &params);

Related Pages

Principle:Ollama_Ollama_Model_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment