Heuristic:Neuml Txtai Model Quantization Defaults

Knowledge Sources	txtai bitsandbytes
Domains	Optimization, Deep_Learning
Last Updated	2026-02-10 00:00 GMT

Overview

Default quantization and LoRA configurations for memory-efficient model training and inference with automatic CUDA requirement enforcement.

Description

txtai provides sensible defaults for model quantization (4-bit NF4 via bitsandbytes) and LoRA fine-tuning (rank 16, alpha 8). When `quantize=True` is passed to the trainer, it automatically configures 4-bit quantization with double quantization and bfloat16 compute dtype. When `lora=True` is passed, it configures LoRA with rank 16, alpha 8, targeting all linear layers. Quantization requires CUDA — it is silently set to `None` on non-CUDA platforms. The pad token is automatically set to the EOS token if not configured.

Usage

Apply these defaults when fine-tuning large language models with limited GPU memory. Set `quantize=True` for 4-bit inference or `lora=True` for parameter-efficient fine-tuning. Both can be combined for QLoRA training. Override with a dictionary for custom configurations.

The Insight (Rule of Thumb)

Quantization Defaults (when `quantize=True`):
- `load_in_4bit`: True
- `bnb_4bit_use_double_quant`: True
- `bnb_4bit_quant_type`: "nf4"
- `bnb_4bit_compute_dtype`: "bfloat16"
LoRA Defaults (when `lora=True`):
- `r`: 16 (rank)
- `lora_alpha`: 8
- `target_modules`: "all-linear"
- `lora_dropout`: 0.05
- `bias`: "none"
Pad Token: Defaults to EOS token if not set
Padding Side: Set to "left" when batching for generation
Trade-off: Quantization reduces VRAM by ~75% at cost of minor quality loss. LoRA reduces trainable parameters by ~99% at cost of limited model capacity changes.

Reasoning

NF4 (Normal Float 4-bit) quantization provides the best quality-to-compression ratio for LLMs according to the QLoRA paper. Double quantization further reduces the memory footprint of quantization constants. Rank 16 for LoRA is a widely-used default that balances capacity with parameter efficiency. Targeting all linear layers ensures consistent adaptation across the model. The bfloat16 compute dtype is preferred over float16 for training stability on modern GPUs (Ampere+).

Quantization is automatically disabled on non-CUDA platforms because bitsandbytes only supports CUDA GPUs. This prevents cryptic errors and allows the same code to work on both GPU and CPU environments.

Code Evidence

Default quantization settings from `pipeline/train/hftrainer.py:289-297`:

if quantize:
    if isinstance(quantize, bool):
        quantize = {
            "load_in_4bit": True,
            "bnb_4bit_use_double_quant": True,
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_compute_dtype": "bfloat16",
        }

CUDA requirement enforcement from `pipeline/train/hftrainer.py:253-254`:

quantization = quantization if torch.cuda.is_available() else None

Default LoRA settings from `pipeline/train/hftrainer.py:342-344`:

if isinstance(lora, bool):
    lora = {"r": 16, "lora_alpha": 8, "target_modules": "all-linear", "lora_dropout": 0.05, "bias": "none"}

Pad token fallback from `pipeline/train/hftrainer.py:98-99`:

tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token is not None else tokenizer.eos_token

Left padding for generation batching from `pipeline/llm/huggingface.py:148-150`:

if "batch_size" in kwargs and self.pipeline.tokenizer.pad_token_id is None:
    self.pipeline.tokenizer.pad_token_id = tokenid
    self.pipeline.tokenizer.padding_side = "left"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment