Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy QuantizationKernels

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Quantization
Last Updated 2026-02-07 15:00 GMT

Overview

CUDA kernel API for symmetric and group-wise tensor quantization and dequantization operations.

Description

This header declares quantization kernel functions for converting tensors between floating-point and quantized representations. QuantizeSymm() performs per-tensor symmetric quantization, computing a single scale factor. DequantizeSymm() reverses the process. QuantizeSymmBlock() and DequantizeSymmBlock() operate at a block granularity for finer quantization accuracy. QuantizeGroupwise() performs group-wise quantization with separate scales and zero points per group, also producing a dequantized output and accepting random bits for stochastic rounding.

Usage

Use these kernels when quantizing model weights or activations for reduced-precision inference (e.g., INT8, INT4), or when implementing weight-only quantization with group-wise scale factors.

Code Reference

Source Location

Signature

void QuantizeSymm(Tensor& out, Tensor& scale, const Tensor& src, cudaStream_t st);
void DequantizeSymm(Tensor& out, const Tensor& src, const Tensor& scale, cudaStream_t st);

void QuantizeSymmBlock(Ref<Tensor> out_, Ref<Tensor> scale_, const Tensor& src, cudaStream_t st);
void DequantizeSymmBlock(Ref<Tensor> out_, Ref<Tensor> src_, const Tensor& scale, cudaStream_t st);

void QuantizeGroupwise(Tensor quant, Tensor scales, Tensor zeros,
                       Tensor dequant, Tensor src,
                       Buffer_<unsigned> rbits, int group_size);

Import

#include "src/turbomind/kernels/quantization.h"

I/O Contract

Inputs

Name Type Required Description
src Tensor Yes Source tensor in floating-point format (m, k)
group_size int Yes (groupwise) Number of elements per quantization group
rbits Buffer_<unsigned> Yes (groupwise) Random bits for stochastic rounding
st cudaStream_t Yes CUDA stream

Outputs

Name Type Description
out / quant Tensor Quantized output tensor
scale / scales Tensor Per-tensor or per-group scale factors
zeros Tensor Per-group zero points (groupwise only)
dequant Tensor Dequantized reconstruction (groupwise only)

Usage Examples

using namespace turbomind;

// Symmetric per-tensor quantization
QuantizeSymm(quantized, scale, input_tensor, stream);

// Group-wise quantization with group_size=128
QuantizeGroupwise(quant, scales, zeros, dequant, src, rbits, 128);

// Dequantize back
DequantizeSymm(output, quantized, scale, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment