Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp IMatrixCollector

From Leeroopedia
Field Value
Implementation Name IMatrixCollector
Doc Type Wrapper Doc
Topic Model Quantization
Workflow Model_Quantization
Category Calibration Data
Repository Ggml_org_Llama_cpp

Overview

Description

The IMatrixCollector class implements importance matrix data collection by intercepting tensor operations during model inference. Its primary method, collect_imatrix(), serves as a callback that the ggml backend invokes for each matrix multiplication operation. The collector accumulates squared activation values per weight tensor across all forward passes, building a statistical profile of weight importance that is later used to guide quantization precision allocation.

The class also manages persistence through save_imatrix() and load_imatrix() methods, storing importance data in GGUF format with metadata keys for dataset provenance, chunk counts, and chunk sizes.

Usage

The IMatrixCollector is instantiated by the llama-imatrix tool, configured with runtime parameters, and registered as an evaluation callback. During inference over calibration text, it automatically collects importance statistics for all relevant weight tensors.

Code Reference

Source Location

  • Class definition: tools/imatrix/imatrix.cpp (lines 60-78)
  • Metadata constants: tools/imatrix/imatrix.cpp (lines 36-38)
  • collect_imatrix method: tools/imatrix/imatrix.cpp (lines 219-340+)

Signature

class IMatrixCollector {
public:
    IMatrixCollector() = default;
    void set_params(common_params params) { m_params = std::move(params); }
    bool collect_imatrix(struct ggml_tensor * t, bool ask, void * user_data);
    void save_imatrix_legacy(int32_t ncall = -1) const;
    void save_imatrix(int32_t n_chunk = -1) const;
    bool load_imatrix_legacy(const char * fname);
    bool load_imatrix(const char * file_name);
    const std::unordered_map<std::string, Stats> & get_mstats() const { return m_stats; }
private:
    std::unordered_map<std::string, Stats> m_stats;
    common_params                          m_params;
    std::mutex                             m_mutex;
    std::vector<std::string>               m_datasets;
    int32_t                                m_last_chunk = 0;
    std::vector<char>                      m_src1_data;
    std::vector<char>                      m_ids;  // the expert ids from ggml_mul_mat_id
};

Supporting data structures:

struct Stats {
    std::vector<float>   values;
    std::vector<int64_t> counts;
};

// Metadata keys for imatrix GGUF files
static const char * const LLM_KV_IMATRIX_DATASETS    = "imatrix.datasets";
static const char * const LLM_KV_IMATRIX_CHUNK_COUNT = "imatrix.chunk_count";
static const char * const LLM_KV_IMATRIX_CHUNK_SIZE  = "imatrix.chunk_size";

Import

#include "common.h"
#include "llama.h"
#include "gguf.h"

I/O Contract

Direction Type Description
Input (t) struct ggml_tensor * The tensor operation being evaluated; t->src[0] contains the weight tensor, t->src[1] contains the activation tensor
Input (ask) bool When true, the scheduler asks if the collector is interested in this tensor's data; when false, actual data collection occurs
Input (user_data) void * User-provided context pointer (unused in current implementation)
Output bool When ask=true: returns true if the collector wants data for this tensor. When ask=false: returns true on success.
Side Effect m_stats map Accumulated squared activation values and counts per tensor name

Tensor filtering logic (when ask=true):

  • Always collects GGML_OP_MUL_MAT_ID operations (MoE expert routing)
  • For GGML_OP_MUL_MAT: requires batch size >= 16 tokens, F32 activations, and tensor name starting with "blk." or being "output.weight" (if process_output is enabled)
  • Rejects all other operation types

Usage Examples

Example 1: Command-line imatrix generation

# Generate importance matrix from calibration text
./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-text.txt \
    -o imatrix.gguf \
    --output-format gguf \
    --chunk 512

Example 2: Using imatrix with quantization

# First generate the importance matrix
./llama-imatrix -m model-f16.gguf -f wiki.train.raw -o imatrix.gguf

# Then quantize using the importance matrix
./llama-quantize --imatrix imatrix.gguf model-f16.gguf model-iq4_xs.gguf IQ4_XS

Example 3: Programmatic usage of collect_imatrix callback

IMatrixCollector collector;
collector.set_params(params);

// Register as eval callback
auto callback = [](struct ggml_tensor * t, bool ask, void * user_data) -> bool {
    return static_cast<IMatrixCollector *>(user_data)->collect_imatrix(t, ask, user_data);
};

// After inference completes, save the collected data
collector.save_imatrix(n_chunks_processed);

Example 4: Incremental collection with previously saved data

# Load a previously computed imatrix and continue collecting
./llama-imatrix \
    -m model-f16.gguf \
    -f additional-text.txt \
    --in-file imatrix-prev.gguf \
    -o imatrix-combined.gguf

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment