Heuristic:Bitsandbytes foundation Bitsandbytes Blocksize Platform Defaults
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Platform-specific blocksize defaults: NVIDIA uses blocksize=64, AMD ROCm uses blocksize=128, driven by warp/wavefront size differences (32 vs 64 threads).
Description
Bitsandbytes 4-bit quantization divides tensors into blocks of a fixed size for independent quantization. The optimal blocksize depends on the GPU's warp/wavefront size: NVIDIA GPUs use 32-thread warps, so blocksize=64 (2 warps) is the default; AMD GPUs use 64-thread wavefronts, so blocksize=128 (2 wavefronts) is the default. This ensures that each quantization block aligns with the hardware's SIMD execution width for optimal kernel utilization.
Usage
Apply this heuristic when configuring 4-bit quantization parameters (NF4 or FP4) on either NVIDIA or AMD GPUs. If you are manually setting blocksize, ensure it matches your hardware platform. Let it default automatically if unsure. Valid blocksizes are: 64, 128, 256, 512, 1024, 2048, and 4096.
The Insight (Rule of Thumb)
- Action: Let bitsandbytes auto-detect blocksize, or set it explicitly based on GPU vendor.
- Value: blocksize=64 for NVIDIA; blocksize=128 for AMD ROCm (64-wide wavefronts).
- Valid range: 64, 128, 256, 512, 1024, 2048, 4096.
- Trade-off: Smaller blocksizes improve quantization accuracy (more fine-grained scaling) but increase metadata overhead (more absmax values stored). Larger blocksizes reduce overhead but coarsen the quantization.
- Safety default: If ROCm warp size detection fails, bitsandbytes defaults to 64 (wider wavefront assumption) to prevent kernel crashes, even though this may be suboptimal.
Reasoning
The blocksize must be a multiple of the GPU's warp/wavefront size for efficient kernel execution. NVIDIA warps are 32 threads; two warps per block (64) is the minimum efficient configuration. AMD wavefronts are 64 threads; two wavefronts per block (128) provides the equivalent efficiency.
The auto-detection works by calling `rocminfo` at import time to read the wavefront size. NVIDIA GPUs always return 32 (hardcoded). The detected value is stored in the `ROCM_WARP_SIZE_64` global flag.
Warp size detection from `bitsandbytes/cuda_specs.py:105-128`:
def get_rocm_warpsize() -> int:
try:
if torch.version.hip:
result = subprocess.run(["rocminfo"], capture_output=True, text=True)
match = re.search(r"Wavefront Size:\s+([0-9]{2})\(0x[0-9]{2}\)", result.stdout)
if match:
return int(match.group(1))
else:
# default to 64 to be safe
return 64
else:
# nvidia cards always use 32 warp size
return 32
except Exception as e:
logger.error(f"Could not detect ROCm warp size: {e}. Defaulting to 64.")
return 64
Blocksize selection from `bitsandbytes/functional.py:858-860`:
if blocksize is None:
blocksize = 64 if not ROCM_WARP_SIZE_64 else 128
Global flag initialization from `bitsandbytes/cextension.py:307`:
ROCM_WARP_SIZE_64 = True if get_rocm_warpsize() == 64 else False