Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Llama Quant

From Leeroopedia
Knowledge Sources
Domains LLM Inference, Quantization
Last Updated 2025-02-15 00:00 GMT

Overview

Implements model quantization, converting full-precision model tensors to lower-precision formats (Q4, Q5, Q8, IQ, etc.) for reduced memory usage and faster inference.

Description

Uses tensor_quantization to define per-tensor quantization type overrides. quantize_state_impl manages the quantization state across threads, tracking tensor indices for attention, FFN, and gate layers. The main quantization function reads tensors from the source GGUF file, determines the target quantization type based on the tensor's role, performs the quantization using ggml's quantization functions, and writes the quantized tensors to a new GGUF file. Supports importance matrix (imatrix) for better quantization quality, layer pruning via remap_layer, and multi-threaded parallel quantization.

Usage

Enables the quantization workflow that makes large models usable on consumer hardware. Ollama's model creation process uses this to convert models to quantized formats that fit in available memory while maintaining acceptable quality.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/src/llama-quant.cpp
  • Lines: 1-1072

Signature

struct tensor_quantization {
    std::string name;
    ggml_type quant = GGML_TYPE_COUNT;
};

struct quantize_state_impl {
    const llama_model                 & model;
    const llama_model_quantize_params * params;
    int n_attention_wv, n_ffn_down, n_ffn_gate, n_ffn_up;
    int i_attention_wv, i_ffn_down, i_ffn_gate, i_ffn_up;
    int n_k_quantized, n_fallback;
    bool has_imatrix, has_output;
};

static void llama_tensor_dequantize_impl(
    ggml_tensor * tensor, std::vector<no_init<float>> & output,
    std::vector<std::thread> & workers, const size_t nelements, const int nthread);

static std::string remap_layer(const std::string & orig_name,
    const std::vector<int> & prune, std::map<int, std::string> & mapped, int & next_id);

Import

#include "llama-quant.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes Source model to quantize
params llama_model_quantize_params Yes Quantization configuration
fname_inp const char * Yes Input GGUF file path
fname_out const char * Yes Output GGUF file path

Outputs

Name Type Description
GGUF file file Quantized model saved to disk

Usage Examples

// Quantization is invoked through the public API:
llama_model_quantize_params params = llama_model_quantize_default_params();
params.nthread = 8;
params.ftype   = LLAMA_FTYPE_MOSTLY_Q4_K_M;

llama_model_quantize("model-f16.gguf", "model-q4_k_m.gguf", &params);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment