Principle:Neuml Txtai Training Configuration

Overview

Training configuration governs the full set of hyperparameters, optimization strategies, and hardware-aware settings that control how a transformer model is fine-tuned. In txtai, the configuration layer sits between the user's intent (what to train and how aggressively) and the Hugging Face Trainer framework that executes the training loop. Getting the configuration right is essential for balancing training speed, model quality, and memory consumption.

Three areas of configuration are particularly important for modern fine-tuning workflows: hyperparameter tuning, quantization for memory-efficient training, and LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

Hyperparameter Configuration

Training hyperparameters control the optimization dynamics of the fine-tuning process. The txtai configuration system provides sensible defaults while allowing full override of any Hugging Face TrainingArguments parameter.

Default Configuration

When no explicit arguments are provided, txtai applies the following defaults:

output_dir -- Set to an empty string, indicating a transient model that is not saved to disk unless explicitly configured.
save_strategy -- Set to "no", disabling intermediate checkpoint saving.
report_to -- Set to "none", disabling experiment tracking integrations.
log_level -- Set to "warning", reducing console noise.
use_cpu -- Automatically detected based on GPU/accelerator availability.

Key Hyperparameters

Users commonly override the following parameters:

Learning rate (learning_rate) -- Controls the step size of gradient updates. Typical values for fine-tuning range from 1e-5 to 5e-5.
Batch size (per_device_train_batch_size) -- Number of examples per device per training step. Constrained by available GPU memory.
Number of epochs (num_train_epochs) -- How many full passes through the training data.
Weight decay (weight_decay) -- L2 regularization penalty to prevent overfitting.
Gradient accumulation steps (gradient_accumulation_steps) -- Simulates larger batch sizes by accumulating gradients across multiple forward passes before updating weights.
FP16/BF16 (fp16, bf16) -- Mixed-precision training for reduced memory and faster computation on compatible hardware.
Seed (seed) -- Random seed for reproducibility. Applied via set_seed() before model initialization.

Output Directory Behavior

The txtai TrainingArguments subclass overrides the should_save property so that when output_dir is empty or falsy, saving is completely disabled. This allows models to be trained entirely in memory for immediate use without writing to disk, which is useful for rapid experimentation and pipeline integration.

4-Bit Quantization (NF4 / QLoRA)

Quantization reduces the memory footprint of model weights by storing them in lower precision. The txtai training system supports 4-bit quantization through the BitsAndBytes library, enabling fine-tuning of large models on consumer hardware.

NF4 Quantization

NF4 (Normal Float 4-bit) is a data type optimized for normally distributed neural network weights. Unlike uniform 4-bit quantization, NF4 distributes quantization levels according to a normal distribution, minimizing information loss for typical weight distributions.

QLoRA

QLoRA combines 4-bit quantization with LoRA adapters, allowing the base model to remain in 4-bit precision while LoRA adapter weights are maintained in higher precision. This approach enables fine-tuning models with billions of parameters on a single GPU.

Default Quantization Settings

When quantization is enabled by setting quantize=True, txtai applies:

load_in_4bit: True -- Loads model weights in 4-bit precision.
bnb_4bit_use_double_quant: True -- Applies double quantization to further reduce memory by quantizing the quantization constants themselves.
bnb_4bit_quant_type: "nf4" -- Uses the NF4 data type for weight quantization.
bnb_4bit_compute_dtype: "bfloat16" -- Performs computation in BFloat16 for numerical stability.

Users can also pass a custom dictionary with any valid BitsAndBytesConfig parameters for full control.

GPU Requirement

Quantization requires a CUDA-compatible GPU. If no GPU is available, the quantization configuration is automatically cleared and ignored, allowing the same code to run on CPU without errors (albeit without quantization benefits).

LoRA: Parameter-Efficient Fine-Tuning

LoRA (Low-Rank Adaptation) is a technique that freezes the original model weights and injects small trainable low-rank matrices into each layer. Instead of updating all model parameters during fine-tuning, only the LoRA adapter weights are trained, dramatically reducing the number of trainable parameters.

How LoRA Works

For a pre-trained weight matrix W of dimension d x k, LoRA decomposes the weight update into two low-rank matrices:

W' = W + BA

where B has dimension d x r and A has dimension r x k, with rank r much smaller than both d and k. During inference, the adapter weights can be merged back into the base model for zero-overhead serving.

Default LoRA Settings

When LoRA is enabled by setting lora=True, txtai applies:

r: 16 -- Rank of the low-rank decomposition. Higher values increase capacity but also memory and compute.
lora_alpha: 8 -- Scaling factor for LoRA updates. Controls the magnitude of the adapter's contribution.
target_modules: "all-linear" -- Applies LoRA to all linear layers in the model.
lora_dropout: 0.05 -- Dropout rate applied to LoRA layers for regularization.
bias: "none" -- Does not train bias parameters.

Task Type Mapping

LoRA requires a task type that determines how the adapter is structured. The txtai system automatically maps training tasks to LoRA task types:

Training Task	LoRA TaskType
text-classification	SEQ_CLS
language-generation	CAUSAL_LM
language-modeling	FEATURE_EXTRACTION
question-answering	QUESTION_ANS
sequence-sequence	SEQ_2_SEQ_LM
token-detection	FEATURE_EXTRACTION

Model Preparation

When LoRA is enabled, the model undergoes two preparation steps:

prepare_model_for_kbit_training() -- Prepares a quantized model for training by handling gradient checkpointing and casting layer norms to full precision.
get_peft_model() -- Wraps the model with LoRA adapters and freezes the original parameters.

After wrapping, the number of trainable parameters is printed to confirm that only a fraction of the total parameters will be updated.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment