Heuristic:Bitsandbytes foundation Bitsandbytes CuBLAS Int8 Dimension Alignment

Knowledge Sources	bitsandbytes cuBLASLt Documentation
Domains	Optimization, Quantization
Last Updated	2026-02-07 13:00 GMT

Overview

cuBLASLt INT8 matmul requires inner dimensions divisible by 4; bitsandbytes automatically falls back to slower FP32 matmul when this constraint is violated.

Description

The NVIDIA cuBLASLt library, which provides the INT8 matrix multiplication used by LLM.int8(), has a hard constraint: the inner dimensions of the matrices must be divisible by 4. When this constraint is not satisfied, bitsandbytes silently falls back to a slower FP32 computation path. This fallback casts both matrices to float32, performs standard matmul, and casts back to int32. The codebase notes this "should not be very common" since model hidden dimensions are almost always multiples of 4, but it can occur with custom model architectures or padding.

Usage

Apply this heuristic when designing custom model architectures that will use INT8 quantization, or when debugging unexpected slowdowns in 8-bit inference. Ensure hidden dimensions are divisible by 4 (ideally by 64 or 128 for optimal tensor core utilization).

The Insight (Rule of Thumb)

Action: Ensure model hidden dimensions are divisible by 4 (minimum) for INT8 matmul.
Value: Divisible by 64 or 128 is optimal for tensor core tile sizes.
Trade-off: Non-aligned dimensions fall back to FP32 matmul, which is 2-3x slower but ensures correctness. No error is raised — the fallback is silent.
Detection: The check is `lda % 4 != 0` where `lda` is the inner dimension of the matmul.

Reasoning

NVIDIA's INT8 tensor cores operate on 4-element vectors internally. The cuBLASLt API enforces this alignment for performance and correctness. The bitsandbytes fallback ensures that odd-shaped tensors still produce correct results, albeit more slowly.

Alignment check and fallback from `bitsandbytes/backends/cuda/ops.py:52-57`:

# cuBLASLt does not support int8 matmul with inner dimensions that are not divisible by 4.
# We'll fall back to a slower fp32 calculation in this circumstance.
# Fortunately, this should not be very common.
if lda % 4 != 0:
    result = torch.matmul(B.float(), A.float().t()).to(torch.int32)
    return out.copy_(result)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment