Implementation:Ggml org Llama cpp IMatrixCollector

Field	Value
Implementation Name	IMatrixCollector
Doc Type	Wrapper Doc
Topic	Model Quantization
Workflow	Model_Quantization
Category	Calibration Data
Repository	Ggml_org_Llama_cpp

Overview

Description

The IMatrixCollector class implements importance matrix data collection by intercepting tensor operations during model inference. Its primary method, collect_imatrix(), serves as a callback that the ggml backend invokes for each matrix multiplication operation. The collector accumulates squared activation values per weight tensor across all forward passes, building a statistical profile of weight importance that is later used to guide quantization precision allocation.

The class also manages persistence through save_imatrix() and load_imatrix() methods, storing importance data in GGUF format with metadata keys for dataset provenance, chunk counts, and chunk sizes.

Usage

The IMatrixCollector is instantiated by the llama-imatrix tool, configured with runtime parameters, and registered as an evaluation callback. During inference over calibration text, it automatically collects importance statistics for all relevant weight tensors.

Code Reference

Source Location

Class definition: tools/imatrix/imatrix.cpp (lines 60-78)
Metadata constants: tools/imatrix/imatrix.cpp (lines 36-38)
collect_imatrix method: tools/imatrix/imatrix.cpp (lines 219-340+)

Signature

class IMatrixCollector {
public:
    IMatrixCollector() = default;
    void set_params(common_params params) { m_params = std::move(params); }
    bool collect_imatrix(struct ggml_tensor * t, bool ask, void * user_data);
    void save_imatrix_legacy(int32_t ncall = -1) const;
    void save_imatrix(int32_t n_chunk = -1) const;
    bool load_imatrix_legacy(const char * fname);
    bool load_imatrix(const char * file_name);
    const std::unordered_map<std::string, Stats> & get_mstats() const { return m_stats; }
private:
    std::unordered_map<std::string, Stats> m_stats;
    common_params                          m_params;
    std::mutex                             m_mutex;
    std::vector<std::string>               m_datasets;
    int32_t                                m_last_chunk = 0;
    std::vector<char>                      m_src1_data;
    std::vector<char>                      m_ids;  // the expert ids from ggml_mul_mat_id
};

Supporting data structures:

struct Stats {
    std::vector<float>   values;
    std::vector<int64_t> counts;
};

// Metadata keys for imatrix GGUF files
static const char * const LLM_KV_IMATRIX_DATASETS    = "imatrix.datasets";
static const char * const LLM_KV_IMATRIX_CHUNK_COUNT = "imatrix.chunk_count";
static const char * const LLM_KV_IMATRIX_CHUNK_SIZE  = "imatrix.chunk_size";

Import

#include "common.h"
#include "llama.h"
#include "gguf.h"

I/O Contract

Direction	Type	Description
Input (t)	`struct ggml_tensor *`	The tensor operation being evaluated; `t->src[0]` contains the weight tensor, `t->src[1]` contains the activation tensor
Input (ask)	`bool`	When true, the scheduler asks if the collector is interested in this tensor's data; when false, actual data collection occurs
Input (user_data)	`void *`	User-provided context pointer (unused in current implementation)
Output	`bool`	When `ask=true`: returns true if the collector wants data for this tensor. When `ask=false`: returns true on success.
Side Effect	`m_stats` map	Accumulated squared activation values and counts per tensor name

Tensor filtering logic (when ask=true):

Always collects GGML_OP_MUL_MAT_ID operations (MoE expert routing)
For GGML_OP_MUL_MAT: requires batch size >= 16 tokens, F32 activations, and tensor name starting with "blk." or being "output.weight" (if process_output is enabled)
Rejects all other operation types

Usage Examples

Example 1: Command-line imatrix generation

# Generate importance matrix from calibration text
./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-text.txt \
    -o imatrix.gguf \
    --output-format gguf \
    --chunk 512

Example 2: Using imatrix with quantization

# First generate the importance matrix
./llama-imatrix -m model-f16.gguf -f wiki.train.raw -o imatrix.gguf

# Then quantize using the importance matrix
./llama-quantize --imatrix imatrix.gguf model-f16.gguf model-iq4_xs.gguf IQ4_XS

Example 3: Programmatic usage of collect_imatrix callback

IMatrixCollector collector;
collector.set_params(params);

// Register as eval callback
auto callback = [](struct ggml_tensor * t, bool ask, void * user_data) -> bool {
    return static_cast<IMatrixCollector *>(user_data)->collect_imatrix(t, ask, user_data);
};

// After inference completes, save the collected data
collector.save_imatrix(n_chunks_processed);

Example 4: Incremental collection with previously saved data

# Load a previously computed imatrix and continue collecting
./llama-imatrix \
    -m model-f16.gguf \
    -f additional-text.txt \
    --in-file imatrix-prev.gguf \
    -o imatrix-combined.gguf

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment