Heuristic:Neuml Txtai Model Quantization Defaults
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Default quantization and LoRA configurations for memory-efficient model training and inference with automatic CUDA requirement enforcement.
Description
txtai provides sensible defaults for model quantization (4-bit NF4 via bitsandbytes) and LoRA fine-tuning (rank 16, alpha 8). When `quantize=True` is passed to the trainer, it automatically configures 4-bit quantization with double quantization and bfloat16 compute dtype. When `lora=True` is passed, it configures LoRA with rank 16, alpha 8, targeting all linear layers. Quantization requires CUDA — it is silently set to `None` on non-CUDA platforms. The pad token is automatically set to the EOS token if not configured.
Usage
Apply these defaults when fine-tuning large language models with limited GPU memory. Set `quantize=True` for 4-bit inference or `lora=True` for parameter-efficient fine-tuning. Both can be combined for QLoRA training. Override with a dictionary for custom configurations.
The Insight (Rule of Thumb)
- Quantization Defaults (when `quantize=True`):
- `load_in_4bit`: True
- `bnb_4bit_use_double_quant`: True
- `bnb_4bit_quant_type`: "nf4"
- `bnb_4bit_compute_dtype`: "bfloat16"
- LoRA Defaults (when `lora=True`):
- `r`: 16 (rank)
- `lora_alpha`: 8
- `target_modules`: "all-linear"
- `lora_dropout`: 0.05
- `bias`: "none"
- Pad Token: Defaults to EOS token if not set
- Padding Side: Set to "left" when batching for generation
- Trade-off: Quantization reduces VRAM by ~75% at cost of minor quality loss. LoRA reduces trainable parameters by ~99% at cost of limited model capacity changes.
Reasoning
NF4 (Normal Float 4-bit) quantization provides the best quality-to-compression ratio for LLMs according to the QLoRA paper. Double quantization further reduces the memory footprint of quantization constants. Rank 16 for LoRA is a widely-used default that balances capacity with parameter efficiency. Targeting all linear layers ensures consistent adaptation across the model. The bfloat16 compute dtype is preferred over float16 for training stability on modern GPUs (Ampere+).
Quantization is automatically disabled on non-CUDA platforms because bitsandbytes only supports CUDA GPUs. This prevents cryptic errors and allows the same code to work on both GPU and CPU environments.
Code Evidence
Default quantization settings from `pipeline/train/hftrainer.py:289-297`:
if quantize:
if isinstance(quantize, bool):
quantize = {
"load_in_4bit": True,
"bnb_4bit_use_double_quant": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": "bfloat16",
}
CUDA requirement enforcement from `pipeline/train/hftrainer.py:253-254`:
quantization = quantization if torch.cuda.is_available() else None
Default LoRA settings from `pipeline/train/hftrainer.py:342-344`:
if isinstance(lora, bool):
lora = {"r": 16, "lora_alpha": 8, "target_modules": "all-linear", "lora_dropout": 0.05, "bias": "none"}
Pad token fallback from `pipeline/train/hftrainer.py:98-99`:
tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token is not None else tokenizer.eos_token
Left padding for generation batching from `pipeline/llm/huggingface.py:148-150`:
if "batch_size" in kwargs and self.pipeline.tokenizer.pad_token_id is None:
self.pipeline.tokenizer.pad_token_id = tokenid
self.pipeline.tokenizer.padding_side = "left"