Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama Model Quantization

From Leeroopedia
Knowledge Sources
Domains Model_Optimization, Compression
Last Updated 2026-02-14 00:00 GMT

Overview

A post-training quantization mechanism that reduces model precision from floating-point to lower-bit integer representations, decreasing model size and memory requirements while preserving inference quality.

Description

Model Quantization compresses neural network weights from full precision (FP32/FP16) to lower-bit representations (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.). This reduces model file size by 2-8x and proportionally reduces memory requirements, enabling larger models to run on consumer hardware.

The quantization is performed per-tensor with type selection based on the tensor's role: attention weights, feed-forward layers, embeddings, and output heads may use different quantization types to balance size reduction against quality preservation. Critical layers (embeddings, output) are often kept at higher precision.

Usage

Use this principle when deploying models on resource-constrained hardware where memory is limited. Quantization is the standard technique for making 7B+ parameter models practical on consumer GPUs and CPUs.

Theoretical Basis

Quantization maps floating-point values to integer representations using block-wise scaling:

qi=round(xis),s=max(|x|)2n11

Where:

  • xi is the original floating-point weight
  • qi is the quantized integer
  • s is the per-block scale factor
  • n is the bit width

Common quantization types:

  • Q4_0: 4-bit quantization with one scale per 32 elements
  • Q4_K_M: 4-bit with k-quant medium (better quality)
  • Q5_K_M: 5-bit with k-quant medium
  • Q8_0: 8-bit quantization (highest quality, larger size)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment