Implementation:Bitsandbytes foundation Bitsandbytes Int8 Vectorwise Quant
Metadata
| Field | Value |
|---|---|
| Sources | Repo: bitsandbytes, Paper: LLM.int8() |
| Domains | Quantization |
| Type | API Doc |
| Last updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for quantizing tensors to INT8 using per-row scaling provided by the bitsandbytes library.
Description
int8_vectorwise_quant quantizes a torch.float16 tensor to torch.int8 with per-row scaling factors, implementing the vectorwise quantization step of the LLM.int8() algorithm. The function dispatches to a native CUDA kernel via torch.ops.bitsandbytes.int8_vectorwise_quant.default.
When the threshold parameter is set to a value greater than 0, the function also performs outlier decomposition: it identifies columns where any element exceeds the threshold in absolute value, suppresses those columns in the quantized output (sets them to zero), and returns the column indices separately so the caller can handle them in FP16.
When threshold is 0 (the default), no outlier detection is performed and the third element of the return tuple is None.
Code Reference
- Source: bitsandbytes repo
- File:
bitsandbytes/functional.py, Lines L1944-1962 - Import:
from bitsandbytes.functional import int8_vectorwise_quant
- Signature:
def int8_vectorwise_quant(
A: torch.Tensor,
threshold: float = 0.0,
) -> tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
I/O Contract
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
A |
torch.Tensor (dtype torch.float16) |
Yes | -- | The input tensor to quantize. Must have dtype float16.
|
threshold |
float | No | 0.0 |
Outlier detection threshold. When 0.0, no outlier decomposition is performed. When > 0, columns with values exceeding this threshold are identified and returned separately.
|
Outputs
| Index | Type | Description |
|---|---|---|
| 0 | torch.Tensor (dtype torch.int8) |
The quantized data. Same shape as input. Outlier columns are zeroed out when threshold > 0.
|
| 1 | torch.Tensor (dtype torch.float32) |
row|) / 127. |
| 2 | torch.Tensor (dtype torch.int32) or None |
Column indices of outlier features. None when threshold=0.0.
|
Usage Examples
Quantize a tensor without outlier detection:
import torch
from bitsandbytes.functional import int8_vectorwise_quant
# Create a float16 tensor
A = torch.randn(4, 8, dtype=torch.float16, device="cuda")
# Quantize to INT8 with per-row scaling
quantized, scales, outlier_cols = int8_vectorwise_quant(A)
print(quantized.dtype) # torch.int8
print(quantized.shape) # torch.Size([4, 8])
print(scales.dtype) # torch.float32
print(scales.shape) # torch.Size([4])
print(outlier_cols) # None (no outlier detection)
Quantize a tensor with outlier decomposition:
import torch
from bitsandbytes.functional import int8_vectorwise_quant
# Create a tensor with some large outlier values
A = torch.randn(4, 8, dtype=torch.float16, device="cuda")
A[0, 2] = 10.0 # Inject an outlier in column 2
A[1, 5] = -8.0 # Inject an outlier in column 5
# Quantize with outlier threshold of 6.0
quantized, scales, outlier_cols = int8_vectorwise_quant(A, threshold=6.0)
print(quantized.dtype) # torch.int8
print(scales.dtype) # torch.float32
print(outlier_cols.dtype) # torch.int32
print(outlier_cols) # tensor([2, 5], dtype=torch.int32) - outlier column indices
# Note: quantized[:, outlier_cols] will be zeroed out