Principle:Ollama Ollama Model Quantization

Knowledge Sources	Ollama GPTQ GGML Quantization
Domains	Model_Optimization, Compression
Last Updated	2026-02-14 00:00 GMT

Overview

A post-training quantization mechanism that reduces model precision from floating-point to lower-bit integer representations, decreasing model size and memory requirements while preserving inference quality.

Description

Model Quantization compresses neural network weights from full precision (FP32/FP16) to lower-bit representations (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.). This reduces model file size by 2-8x and proportionally reduces memory requirements, enabling larger models to run on consumer hardware.

The quantization is performed per-tensor with type selection based on the tensor's role: attention weights, feed-forward layers, embeddings, and output heads may use different quantization types to balance size reduction against quality preservation. Critical layers (embeddings, output) are often kept at higher precision.

Usage

Use this principle when deploying models on resource-constrained hardware where memory is limited. Quantization is the standard technique for making 7B+ parameter models practical on consumer GPUs and CPUs.

Theoretical Basis

Quantization maps floating-point values to integer representations using block-wise scaling:

$q_{i} = round (\frac{x_{i}}{s}), s = \frac{\max (| x |)}{2^{n - 1} - 1}$

Where:

$x_{i}$ is the original floating-point weight
$q_{i}$ is the quantized integer
$s$ is the per-block scale factor
$n$ is the bit width

Common quantization types:

Q4_0: 4-bit quantization with one scale per 32 elements
Q4_K_M: 4-bit with k-quant medium (better quality)
Q5_K_M: 5-bit with k-quant medium
Q8_0: 8-bit quantization (highest quality, larger size)

Related Pages

Implemented By

Implementation:Ollama_Ollama_Quantize

Uses Heuristic

Heuristic:Ollama_Ollama_Quantization_Layer_Selection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment