Implementation:InternLM Lmdeploy QuantizationKernels
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Quantization |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
CUDA kernel API for symmetric and group-wise tensor quantization and dequantization operations.
Description
This header declares quantization kernel functions for converting tensors between floating-point and quantized representations. QuantizeSymm() performs per-tensor symmetric quantization, computing a single scale factor. DequantizeSymm() reverses the process. QuantizeSymmBlock() and DequantizeSymmBlock() operate at a block granularity for finer quantization accuracy. QuantizeGroupwise() performs group-wise quantization with separate scales and zero points per group, also producing a dequantized output and accepting random bits for stochastic rounding.
Usage
Use these kernels when quantizing model weights or activations for reduced-precision inference (e.g., INT8, INT4), or when implementing weight-only quantization with group-wise scale factors.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/quantization.h
Signature
void QuantizeSymm(Tensor& out, Tensor& scale, const Tensor& src, cudaStream_t st);
void DequantizeSymm(Tensor& out, const Tensor& src, const Tensor& scale, cudaStream_t st);
void QuantizeSymmBlock(Ref<Tensor> out_, Ref<Tensor> scale_, const Tensor& src, cudaStream_t st);
void DequantizeSymmBlock(Ref<Tensor> out_, Ref<Tensor> src_, const Tensor& scale, cudaStream_t st);
void QuantizeGroupwise(Tensor quant, Tensor scales, Tensor zeros,
Tensor dequant, Tensor src,
Buffer_<unsigned> rbits, int group_size);
Import
#include "src/turbomind/kernels/quantization.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| src | Tensor | Yes | Source tensor in floating-point format (m, k) |
| group_size | int | Yes (groupwise) | Number of elements per quantization group |
| rbits | Buffer_<unsigned> | Yes (groupwise) | Random bits for stochastic rounding |
| st | cudaStream_t | Yes | CUDA stream |
Outputs
| Name | Type | Description |
|---|---|---|
| out / quant | Tensor | Quantized output tensor |
| scale / scales | Tensor | Per-tensor or per-group scale factors |
| zeros | Tensor | Per-group zero points (groupwise only) |
| dequant | Tensor | Dequantized reconstruction (groupwise only) |
Usage Examples
using namespace turbomind;
// Symmetric per-tensor quantization
QuantizeSymm(quantized, scale, input_tensor, stream);
// Group-wise quantization with group_size=128
QuantizeGroupwise(quant, scales, zeros, dequant, src, rbits, 128);
// Dequantize back
DequantizeSymm(output, quantized, scale, stream);