Implementation:Bitsandbytes foundation Bitsandbytes HPU Dequantize 4bit

Knowledge Sources	Bitsandbytes
Domains	HPU_Backend, Dequantization, 4bit_Quantization
Last Updated	2026-02-07 13:31 GMT

Overview

Habana Gaudi HPU backend kernel that dequantizes NF4-quantized tensors using the native HPU dequantize_nf4 operation.

Description

This module registers the dequantize_4bit kernel for the HPU device backend using the bitsandbytes custom op registration system. It delegates to torch.ops.hpu.dequantize_nf4, the Habana-native NF4 dequantization operation. The implementation handles backward compatibility with older Gaudi software versions (pre-1.22) by reversing the 4-bit compression format, and supports both uint8 and bfloat16 quant_storage formats.

Usage

This kernel is automatically dispatched when running bitsandbytes on Habana Gaudi hardware. Users do not call it directly; it is registered via @register_kernel("bitsandbytes::dequantize_4bit", "hpu") and invoked through the standard bitsandbytes.functional.dequantize_4bit API.

Code Reference

Source Location

Repository: bitsandbytes
File: bitsandbytes/backends/hpu/ops.py
Lines: 1-55

Signature

@register_kernel("bitsandbytes::dequantize_4bit", "hpu")
def _(
    A: torch.Tensor,
    absmax: torch.Tensor,
    blocksize: int,
    quant_type: str,
    shape: Sequence[int],
    dtype: torch.dtype,
) -> torch.Tensor:
    """Dequantize NF4 tensor on Habana Gaudi HPU."""

Import

# Auto-registered when HPU backend is loaded
# Used via:
import bitsandbytes.functional as F
F.dequantize_4bit(quantized_tensor, quant_state)

I/O Contract

Inputs

Name	Type	Required	Description
A	torch.Tensor	Yes	NF4-quantized tensor (uint8 or bfloat16 storage)
absmax	torch.Tensor	Yes	Per-block absolute maximum scaling factors
blocksize	int	Yes	Number of elements per quantization block
quant_type	str	Yes	Must be "nf4" (only NF4 supported on HPU)
shape	Sequence[int]	Yes	Target output shape
dtype	torch.dtype	Yes	Target output dtype

Outputs

Name	Type	Description
output	torch.Tensor	Dequantized tensor with specified shape and dtype

Usage Examples

Dequantize 4-bit on HPU

import torch
import bitsandbytes as bnb

# On Habana Gaudi device, the HPU kernel is dispatched automatically
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb.BitsAndBytesConfig(load_in_4bit=True),
    device_map="hpu",
)
# Dequantization uses torch.ops.hpu.dequantize_nf4 internally

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment