Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bitsandbytes foundation Bitsandbytes HPU Dequantize 4bit

From Leeroopedia


Knowledge Sources
Domains HPU_Backend, Dequantization, 4bit_Quantization
Last Updated 2026-02-07 13:31 GMT

Overview

Habana Gaudi HPU backend kernel that dequantizes NF4-quantized tensors using the native HPU dequantize_nf4 operation.

Description

This module registers the dequantize_4bit kernel for the HPU device backend using the bitsandbytes custom op registration system. It delegates to torch.ops.hpu.dequantize_nf4, the Habana-native NF4 dequantization operation. The implementation handles backward compatibility with older Gaudi software versions (pre-1.22) by reversing the 4-bit compression format, and supports both uint8 and bfloat16 quant_storage formats.

Usage

This kernel is automatically dispatched when running bitsandbytes on Habana Gaudi hardware. Users do not call it directly; it is registered via @register_kernel("bitsandbytes::dequantize_4bit", "hpu") and invoked through the standard bitsandbytes.functional.dequantize_4bit API.

Code Reference

Source Location

Signature

@register_kernel("bitsandbytes::dequantize_4bit", "hpu")
def _(
    A: torch.Tensor,
    absmax: torch.Tensor,
    blocksize: int,
    quant_type: str,
    shape: Sequence[int],
    dtype: torch.dtype,
) -> torch.Tensor:
    """Dequantize NF4 tensor on Habana Gaudi HPU."""

Import

# Auto-registered when HPU backend is loaded
# Used via:
import bitsandbytes.functional as F
F.dequantize_4bit(quantized_tensor, quant_state)

I/O Contract

Inputs

Name Type Required Description
A torch.Tensor Yes NF4-quantized tensor (uint8 or bfloat16 storage)
absmax torch.Tensor Yes Per-block absolute maximum scaling factors
blocksize int Yes Number of elements per quantization block
quant_type str Yes Must be "nf4" (only NF4 supported on HPU)
shape Sequence[int] Yes Target output shape
dtype torch.dtype Yes Target output dtype

Outputs

Name Type Description
output torch.Tensor Dequantized tensor with specified shape and dtype

Usage Examples

Dequantize 4-bit on HPU

import torch
import bitsandbytes as bnb

# On Habana Gaudi device, the HPU kernel is dispatched automatically
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb.BitsAndBytesConfig(load_in_4bit=True),
    device_map="hpu",
)
# Dequantization uses torch.ops.hpu.dequantize_nf4 internally

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment