Implementation:Bitsandbytes foundation Bitsandbytes Int8 Vectorwise Quant

Metadata

Field	Value
Sources	Repo: bitsandbytes, Paper: LLM.int8()
Domains	Quantization
Type	API Doc
Last updated	2026-02-07 14:00 GMT

Overview

Concrete tool for quantizing tensors to INT8 using per-row scaling provided by the bitsandbytes library.

Description

int8_vectorwise_quant quantizes a torch.float16 tensor to torch.int8 with per-row scaling factors, implementing the vectorwise quantization step of the LLM.int8() algorithm. The function dispatches to a native CUDA kernel via torch.ops.bitsandbytes.int8_vectorwise_quant.default.

When the threshold parameter is set to a value greater than 0, the function also performs outlier decomposition: it identifies columns where any element exceeds the threshold in absolute value, suppresses those columns in the quantized output (sets them to zero), and returns the column indices separately so the caller can handle them in FP16.

When threshold is 0 (the default), no outlier detection is performed and the third element of the return tuple is None.

Code Reference

Source: bitsandbytes repo
File: bitsandbytes/functional.py, Lines L1944-1962
Import:

from bitsandbytes.functional import int8_vectorwise_quant

Signature:

def int8_vectorwise_quant(
    A: torch.Tensor,
    threshold: float = 0.0,
) -> tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:

I/O Contract

Inputs

Parameter	Type	Required	Default	Description
`A`	`torch.Tensor` (dtype `torch.float16`)	Yes	--	The input tensor to quantize. Must have dtype `float16`.
`threshold`	float	No	`0.0`	Outlier detection threshold. When `0.0`, no outlier decomposition is performed. When > 0, columns with values exceeding this threshold are identified and returned separately.

Outputs

Index	Type	Description
0	`torch.Tensor` (dtype `torch.int8`)	The quantized data. Same shape as input. Outlier columns are zeroed out when `threshold > 0`.
1	`torch.Tensor` (dtype `torch.float32`)	row\|) / 127.
2	`torch.Tensor` (dtype `torch.int32`) or `None`	Column indices of outlier features. `None` when `threshold=0.0`.

Usage Examples

Quantize a tensor without outlier detection:

import torch
from bitsandbytes.functional import int8_vectorwise_quant

# Create a float16 tensor
A = torch.randn(4, 8, dtype=torch.float16, device="cuda")

# Quantize to INT8 with per-row scaling
quantized, scales, outlier_cols = int8_vectorwise_quant(A)

print(quantized.dtype)       # torch.int8
print(quantized.shape)       # torch.Size([4, 8])
print(scales.dtype)          # torch.float32
print(scales.shape)          # torch.Size([4])
print(outlier_cols)          # None (no outlier detection)

Quantize a tensor with outlier decomposition:

import torch
from bitsandbytes.functional import int8_vectorwise_quant

# Create a tensor with some large outlier values
A = torch.randn(4, 8, dtype=torch.float16, device="cuda")
A[0, 2] = 10.0  # Inject an outlier in column 2
A[1, 5] = -8.0  # Inject an outlier in column 5

# Quantize with outlier threshold of 6.0
quantized, scales, outlier_cols = int8_vectorwise_quant(A, threshold=6.0)

print(quantized.dtype)       # torch.int8
print(scales.dtype)          # torch.float32
print(outlier_cols.dtype)    # torch.int32
print(outlier_cols)          # tensor([2, 5], dtype=torch.int32) - outlier column indices
# Note: quantized[:, outlier_cols] will be zeroed out

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment