Implementation:Bitsandbytes foundation Bitsandbytes HPU Dequantize 4bit
| Knowledge Sources | |
|---|---|
| Domains | HPU_Backend, Dequantization, 4bit_Quantization |
| Last Updated | 2026-02-07 13:31 GMT |
Overview
Habana Gaudi HPU backend kernel that dequantizes NF4-quantized tensors using the native HPU dequantize_nf4 operation.
Description
This module registers the dequantize_4bit kernel for the HPU device backend using the bitsandbytes custom op registration system. It delegates to torch.ops.hpu.dequantize_nf4, the Habana-native NF4 dequantization operation. The implementation handles backward compatibility with older Gaudi software versions (pre-1.22) by reversing the 4-bit compression format, and supports both uint8 and bfloat16 quant_storage formats.
Usage
This kernel is automatically dispatched when running bitsandbytes on Habana Gaudi hardware. Users do not call it directly; it is registered via @register_kernel("bitsandbytes::dequantize_4bit", "hpu") and invoked through the standard bitsandbytes.functional.dequantize_4bit API.
Code Reference
Source Location
- Repository: bitsandbytes
- File: bitsandbytes/backends/hpu/ops.py
- Lines: 1-55
Signature
@register_kernel("bitsandbytes::dequantize_4bit", "hpu")
def _(
A: torch.Tensor,
absmax: torch.Tensor,
blocksize: int,
quant_type: str,
shape: Sequence[int],
dtype: torch.dtype,
) -> torch.Tensor:
"""Dequantize NF4 tensor on Habana Gaudi HPU."""
Import
# Auto-registered when HPU backend is loaded
# Used via:
import bitsandbytes.functional as F
F.dequantize_4bit(quantized_tensor, quant_state)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| A | torch.Tensor | Yes | NF4-quantized tensor (uint8 or bfloat16 storage) |
| absmax | torch.Tensor | Yes | Per-block absolute maximum scaling factors |
| blocksize | int | Yes | Number of elements per quantization block |
| quant_type | str | Yes | Must be "nf4" (only NF4 supported on HPU) |
| shape | Sequence[int] | Yes | Target output shape |
| dtype | torch.dtype | Yes | Target output dtype |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch.Tensor | Dequantized tensor with specified shape and dtype |
Usage Examples
Dequantize 4-bit on HPU
import torch
import bitsandbytes as bnb
# On Habana Gaudi device, the HPU kernel is dispatched automatically
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb.BitsAndBytesConfig(load_in_4bit=True),
device_map="hpu",
)
# Dequantization uses torch.ops.hpu.dequantize_nf4 internally