Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io Quantize Func

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Compression, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for 8-bit symmetric quantization and dequantization of tensors provided by the LLMBook repository.

Description

The quantize_func function maps float32 tensors to int8 by dividing by the scale factor, adding the zero point, rounding, and clamping. The dequantize_func reverses the process to reconstruct approximate float values. Together they demonstrate the fundamental quantization-dequantization round-trip.

Usage

Use these functions to understand the basic quantization algorithm. For production quantization, use libraries like bitsandbytes or auto-gptq.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/9.2 量化示例.py
  • Lines: 4-13

Signature

def quantize_func(x: Tensor, scales: float, zero_point: int, n_bits: int = 8) -> Tensor:
    """
    Quantizes a float tensor to integer representation.

    Args:
        x: Input float32 tensor.
        scales: Scale factor S = (beta - alpha) / (beta_q - alpha_q).
        zero_point: Zero point offset Z.
        n_bits: Bit width (default 8).

    Returns:
        Clamped integer tensor in [alpha_q, beta_q].
    """

def dequantize_func(x_q: Tensor, scales: float, zero_point: int) -> Tensor:
    """
    Dequantizes an integer tensor back to float32.

    Args:
        x_q: Quantized integer tensor.
        scales: Scale factor.
        zero_point: Zero point offset.

    Returns:
        Reconstructed float32 tensor.
    """

Import

from quantization import quantize_func, dequantize_func

I/O Contract

Inputs

Name Type Required Description
x Tensor Yes Float32 tensor to quantize
scales float Yes Quantization scale factor
zero_point int Yes Zero point offset
n_bits int No Bit width (default 8)

Outputs

Name Type Description
quantize_func returns Tensor Integer tensor clamped to [alpha_q, beta_q]
dequantize_func returns Tensor Reconstructed float32 tensor

Usage Examples

import torch
import numpy as np

# Configuration
alpha, beta = -100.0, 80.0
n_bits = 8
alpha_q, beta_q = -128, 127

# Compute quantization parameters
S = (beta - alpha) / (beta_q - alpha_q)
Z = int((beta * alpha_q - alpha * beta_q) / (beta - alpha))

# Quantize
float_x = torch.tensor([[-1.2136, 28.7341, 8.4974],
                         [-1.9210, -23.7421, 16.2609]])
x_q = quantize_func(float_x, S, Z)
print(f"Quantized: {x_q}")

# Dequantize
x_re = dequantize_func(x_q, S, Z)
print(f"Reconstructed: {x_re}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment