Heuristic:Bigscience workshop Petals NF4 Quantization Default On CUDA
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management |
| Last Updated | 2026-02-09 13:00 GMT |
Overview
Petals defaults to NF4 (4-bit NormalFloat) quantization on CUDA GPUs, roughly halving memory usage compared to float16 while preserving model quality.
Description
When a server starts on a CUDA device without an explicit `--quant_type`, Petals automatically selects NF4 quantization (from the QLoRA paper). NF4 uses 4-bit NormalFloat representation with double quantization (compressed statistics), stored in blocksize=64 chunks. On non-CUDA devices, quantization is disabled (QuantType.NONE) since bitsandbytes requires CUDA. INT8 quantization (LLM.int8()) is also available as an alternative with a threshold of 6.0.
Usage
Applied during server initialization when `quant_type` is None. Override with `--quant_type none` for full precision, `--quant_type int8` for 8-bit, or `--quant_type nf4` for explicit 4-bit.
The Insight (Rule of Thumb)
- Action: Use NF4 quantization (default) on CUDA GPUs to maximize the number of blocks a server can host.
- Value: NF4 (4-bit) reduces model weight memory by ~4x compared to float16. INT8 reduces by ~2x.
- Trade-off: Slight quality degradation from quantization; NF4 has slower parallel forward (see Short_Inference_Pool_Merging). Non-CUDA devices get no quantization.
Reasoning
The primary bottleneck for Petals servers is GPU memory. More blocks served per server means higher throughput for the swarm. NF4 quantization from QLoRA provides near-lossless compression at 4 bits per parameter using a normal float distribution optimized for neural network weights. The blocksize of 64 and compressed statistics (double quantization) are hardcoded defaults from the QLoRA paper.
Code Evidence
Default quantization selection from `src/petals/server/server.py:189-191`:
if quant_type is None:
quant_type = QuantType.NF4 if device.type == "cuda" else QuantType.NONE
NF4 quantization with compressed statistics from `src/petals/utils/convert_block.py:97-111`:
elif quant_type == QuantType.NF4:
compress_statistics = True
model._modules[n] = bnb.nn.LinearNF4(
module.in_features, module.out_features, module.bias is not None,
compress_statistics=compress_statistics,
)
model._modules[n].weight = bnb.nn.Params4bit(
module.weight.data, requires_grad=False,
quant_type="nf4", blocksize=64,
compress_statistics=compress_statistics,
).to(module.weight.dtype)