Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Bitsandbytes foundation Bitsandbytes Blocksize Platform Defaults

From Leeroopedia




Knowledge Sources
Domains Optimization, Quantization
Last Updated 2026-02-07 13:00 GMT

Overview

Platform-specific blocksize defaults: NVIDIA uses blocksize=64, AMD ROCm uses blocksize=128, driven by warp/wavefront size differences (32 vs 64 threads).

Description

Bitsandbytes 4-bit quantization divides tensors into blocks of a fixed size for independent quantization. The optimal blocksize depends on the GPU's warp/wavefront size: NVIDIA GPUs use 32-thread warps, so blocksize=64 (2 warps) is the default; AMD GPUs use 64-thread wavefronts, so blocksize=128 (2 wavefronts) is the default. This ensures that each quantization block aligns with the hardware's SIMD execution width for optimal kernel utilization.

Usage

Apply this heuristic when configuring 4-bit quantization parameters (NF4 or FP4) on either NVIDIA or AMD GPUs. If you are manually setting blocksize, ensure it matches your hardware platform. Let it default automatically if unsure. Valid blocksizes are: 64, 128, 256, 512, 1024, 2048, and 4096.

The Insight (Rule of Thumb)

  • Action: Let bitsandbytes auto-detect blocksize, or set it explicitly based on GPU vendor.
  • Value: blocksize=64 for NVIDIA; blocksize=128 for AMD ROCm (64-wide wavefronts).
  • Valid range: 64, 128, 256, 512, 1024, 2048, 4096.
  • Trade-off: Smaller blocksizes improve quantization accuracy (more fine-grained scaling) but increase metadata overhead (more absmax values stored). Larger blocksizes reduce overhead but coarsen the quantization.
  • Safety default: If ROCm warp size detection fails, bitsandbytes defaults to 64 (wider wavefront assumption) to prevent kernel crashes, even though this may be suboptimal.

Reasoning

The blocksize must be a multiple of the GPU's warp/wavefront size for efficient kernel execution. NVIDIA warps are 32 threads; two warps per block (64) is the minimum efficient configuration. AMD wavefronts are 64 threads; two wavefronts per block (128) provides the equivalent efficiency.

The auto-detection works by calling `rocminfo` at import time to read the wavefront size. NVIDIA GPUs always return 32 (hardcoded). The detected value is stored in the `ROCM_WARP_SIZE_64` global flag.

Warp size detection from `bitsandbytes/cuda_specs.py:105-128`:

def get_rocm_warpsize() -> int:
    try:
        if torch.version.hip:
            result = subprocess.run(["rocminfo"], capture_output=True, text=True)
            match = re.search(r"Wavefront Size:\s+([0-9]{2})\(0x[0-9]{2}\)", result.stdout)
            if match:
                return int(match.group(1))
            else:
                # default to 64 to be safe
                return 64
        else:
            # nvidia cards always use 32 warp size
            return 32
    except Exception as e:
        logger.error(f"Could not detect ROCm warp size: {e}. Defaulting to 64.")
        return 64

Blocksize selection from `bitsandbytes/functional.py:858-860`:

if blocksize is None:
    blocksize = 64 if not ROCM_WARP_SIZE_64 else 128

Global flag initialization from `bitsandbytes/cextension.py:307`:

ROCM_WARP_SIZE_64 = True if get_rocm_warpsize() == 64 else False

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment