Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Bigscience workshop Petals NF4 Quantization Default On CUDA

From Leeroopedia





Knowledge Sources
Domains Optimization, Memory_Management
Last Updated 2026-02-09 13:00 GMT

Overview

Petals defaults to NF4 (4-bit NormalFloat) quantization on CUDA GPUs, roughly halving memory usage compared to float16 while preserving model quality.

Description

When a server starts on a CUDA device without an explicit `--quant_type`, Petals automatically selects NF4 quantization (from the QLoRA paper). NF4 uses 4-bit NormalFloat representation with double quantization (compressed statistics), stored in blocksize=64 chunks. On non-CUDA devices, quantization is disabled (QuantType.NONE) since bitsandbytes requires CUDA. INT8 quantization (LLM.int8()) is also available as an alternative with a threshold of 6.0.

Usage

Applied during server initialization when `quant_type` is None. Override with `--quant_type none` for full precision, `--quant_type int8` for 8-bit, or `--quant_type nf4` for explicit 4-bit.

The Insight (Rule of Thumb)

  • Action: Use NF4 quantization (default) on CUDA GPUs to maximize the number of blocks a server can host.
  • Value: NF4 (4-bit) reduces model weight memory by ~4x compared to float16. INT8 reduces by ~2x.
  • Trade-off: Slight quality degradation from quantization; NF4 has slower parallel forward (see Short_Inference_Pool_Merging). Non-CUDA devices get no quantization.

Reasoning

The primary bottleneck for Petals servers is GPU memory. More blocks served per server means higher throughput for the swarm. NF4 quantization from QLoRA provides near-lossless compression at 4 bits per parameter using a normal float distribution optimized for neural network weights. The blocksize of 64 and compressed statistics (double quantization) are hardcoded defaults from the QLoRA paper.

Code Evidence

Default quantization selection from `src/petals/server/server.py:189-191`:

if quant_type is None:
    quant_type = QuantType.NF4 if device.type == "cuda" else QuantType.NONE

NF4 quantization with compressed statistics from `src/petals/utils/convert_block.py:97-111`:

elif quant_type == QuantType.NF4:
    compress_statistics = True
    model._modules[n] = bnb.nn.LinearNF4(
        module.in_features, module.out_features, module.bias is not None,
        compress_statistics=compress_statistics,
    )
    model._modules[n].weight = bnb.nn.Params4bit(
        module.weight.data, requires_grad=False,
        quant_type="nf4", blocksize=64,
        compress_statistics=compress_statistics,
    ).to(module.weight.dtype)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment