Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Neuml Txtai Model Quantization Defaults

From Leeroopedia
Revision as of 10:49, 16 February 2026 by Admin (talk | contribs) (Auto-imported from heuristics/Neuml_Txtai_Model_Quantization_Defaults.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)




Knowledge Sources
Domains Optimization, Deep_Learning
Last Updated 2026-02-10 00:00 GMT

Overview

Default quantization and LoRA configurations for memory-efficient model training and inference with automatic CUDA requirement enforcement.

Description

txtai provides sensible defaults for model quantization (4-bit NF4 via bitsandbytes) and LoRA fine-tuning (rank 16, alpha 8). When `quantize=True` is passed to the trainer, it automatically configures 4-bit quantization with double quantization and bfloat16 compute dtype. When `lora=True` is passed, it configures LoRA with rank 16, alpha 8, targeting all linear layers. Quantization requires CUDA — it is silently set to `None` on non-CUDA platforms. The pad token is automatically set to the EOS token if not configured.

Usage

Apply these defaults when fine-tuning large language models with limited GPU memory. Set `quantize=True` for 4-bit inference or `lora=True` for parameter-efficient fine-tuning. Both can be combined for QLoRA training. Override with a dictionary for custom configurations.

The Insight (Rule of Thumb)

  • Quantization Defaults (when `quantize=True`):
    • `load_in_4bit`: True
    • `bnb_4bit_use_double_quant`: True
    • `bnb_4bit_quant_type`: "nf4"
    • `bnb_4bit_compute_dtype`: "bfloat16"
  • LoRA Defaults (when `lora=True`):
    • `r`: 16 (rank)
    • `lora_alpha`: 8
    • `target_modules`: "all-linear"
    • `lora_dropout`: 0.05
    • `bias`: "none"
  • Pad Token: Defaults to EOS token if not set
  • Padding Side: Set to "left" when batching for generation
  • Trade-off: Quantization reduces VRAM by ~75% at cost of minor quality loss. LoRA reduces trainable parameters by ~99% at cost of limited model capacity changes.

Reasoning

NF4 (Normal Float 4-bit) quantization provides the best quality-to-compression ratio for LLMs according to the QLoRA paper. Double quantization further reduces the memory footprint of quantization constants. Rank 16 for LoRA is a widely-used default that balances capacity with parameter efficiency. Targeting all linear layers ensures consistent adaptation across the model. The bfloat16 compute dtype is preferred over float16 for training stability on modern GPUs (Ampere+).

Quantization is automatically disabled on non-CUDA platforms because bitsandbytes only supports CUDA GPUs. This prevents cryptic errors and allows the same code to work on both GPU and CPU environments.

Code Evidence

Default quantization settings from `pipeline/train/hftrainer.py:289-297`:

if quantize:
    if isinstance(quantize, bool):
        quantize = {
            "load_in_4bit": True,
            "bnb_4bit_use_double_quant": True,
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_compute_dtype": "bfloat16",
        }

CUDA requirement enforcement from `pipeline/train/hftrainer.py:253-254`:

quantization = quantization if torch.cuda.is_available() else None

Default LoRA settings from `pipeline/train/hftrainer.py:342-344`:

if isinstance(lora, bool):
    lora = {"r": 16, "lora_alpha": 8, "target_modules": "all-linear", "lora_dropout": 0.05, "bias": "none"}

Pad token fallback from `pipeline/train/hftrainer.py:98-99`:

tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token is not None else tokenizer.eos_token

Left padding for generation batching from `pipeline/llm/huggingface.py:148-150`:

if "batch_size" in kwargs and self.pipeline.tokenizer.pad_token_id is None:
    self.pipeline.tokenizer.pad_token_id = tokenid
    self.pipeline.tokenizer.padding_side = "left"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment