Workflow:Axolotl ai cloud Axolotl QLoRA SFT Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, QLoRA, Parameter_Efficient |
| Last Updated | 2026-02-06 22:00 GMT |
Overview
End-to-end process for parameter-efficient supervised fine-tuning (SFT) of large language models using QLoRA (Quantized Low-Rank Adapters) with Axolotl's YAML-driven configuration system.
Description
This workflow covers the most common Axolotl use case: fine-tuning an LLM on instruction-following or domain-specific data using QLoRA. The base model is loaded in 4-bit quantized form to minimize GPU memory, and small trainable LoRA adapter matrices are injected into the model's attention and feedforward layers. Only these adapter weights are trained, dramatically reducing memory and compute requirements. The workflow spans YAML configuration, optional dataset preprocessing, model loading with quantization, adapter injection, training with optimizations (flash attention, sample packing, gradient checkpointing), and saving the trained adapter weights. Optionally, the adapter can be merged back into the base model for deployment.
Usage
Execute this workflow when you have a labeled dataset (instruction-tuning, chat, or completion format) and need to adapt a base LLM to follow domain-specific instructions or adopt a particular response style, while operating under GPU memory constraints (e.g., a single GPU with 16-24GB VRAM). This is the recommended starting point for most Axolotl users.
Execution Steps
Step 1: Configuration
Create a YAML configuration file specifying the base model, dataset paths, adapter settings, and training hyperparameters. The configuration must define: the base model identifier (HuggingFace hub or local path), the dataset source and format type (e.g., alpaca, chat_template), QLoRA adapter parameters (rank, alpha, dropout, target modules), quantization settings (4-bit with NF4), sequence length, batch sizes, optimizer, learning rate schedule, and output directory.
Key considerations:
- Set
adapter: qloraandload_in_4bit: true - Choose appropriate
lora_r(rank) andlora_alphavalues - Define
lora_target_modulesto specify which layers receive adapters - Set
sequence_lenbased on your data and available memory - Enable
sample_packing: trueandflash_attention: truefor efficiency
Step 2: Configuration Validation
Axolotl validates the YAML configuration against the system's GPU capabilities before proceeding. This includes checking compute capability for bf16/fp8 support, verifying dataset paths exist, normalizing dtype settings, setting up distributed training environment variables, and registering any configured plugins.
Key considerations:
- Validation is automatic when running
axolotl train - Errors are raised early for invalid configurations
- Use
axolotl preprocessto validate data formatting separately
Step 3: Dataset Preparation
Load and preprocess the training dataset from local files, HuggingFace Hub, or cloud storage. Axolotl applies the configured prompt strategy (e.g., Alpaca template, chat template) to format each example, tokenizes the formatted prompts, applies train/validation splitting, and optionally enables sample packing to combine shorter sequences for efficient GPU utilization.
Key considerations:
- Use
axolotl preprocessfor large datasets to precompute tokenization - Sample packing groups multiple examples into single sequences for higher throughput
- The
dataset_prepared_pathconfig option caches preprocessed data for reuse - Support for multiple datasets with different formats in a single training run
Step 4: Model Loading and Quantization
Load the base model from HuggingFace Hub or local path with 4-bit NormalFloat (NF4) quantization applied on-the-fly. The model loader configures BitsAndBytes quantization, sets the compute dtype (typically bf16), applies attention mechanism patches (Flash Attention 2 or SDPA), and prepares the quantized model for adapter training.
Key considerations:
- The base model weights remain frozen in 4-bit precision
- Compute operations use higher precision (bf16) via double quantization
- Flash Attention patches are applied for memory-efficient attention computation
- Model configuration is validated against the tokenizer
Step 5: LoRA Adapter Injection
Inject low-rank adapter matrices into the specified model layers. For each target module (attention projections, feedforward layers), two small matrices A and B are added such that the effective weight becomes W' = W + BA. Only these small adapter matrices are trainable, typically representing less than 1% of the total model parameters.
Key considerations:
lora_target_modulescontrols which layers receive adapters- Alternatively,
lora_target_linear: truetargets all linear layers - The rank
lora_rcontrols adapter capacity (higher = more parameters) lora_alphascales the adapter contribution
Step 6: Training Execution
Execute the training loop using HuggingFace's Trainer with Axolotl's customizations. The trainer handles gradient accumulation, mixed precision (bf16), gradient checkpointing for memory efficiency, learning rate scheduling (cosine with warmup), periodic evaluation, checkpointing, and optional integration with experiment trackers (WandB, MLflow).
Key considerations:
- Gradient checkpointing trades compute for memory savings
- The loss watchdog monitors for training instability
- Checkpoints are saved at configured intervals for resumption
- Training can be resumed from the last checkpoint if interrupted
Step 7: Model Saving and Export
Save the trained LoRA adapter weights, tokenizer, and configuration to the output directory. Optionally merge the adapter into the base model to produce a standalone model for deployment. The merged model can be uploaded to HuggingFace Hub or used directly for inference.
Key considerations:
- By default, only adapter weights are saved (small files)
- Use
axolotl merge-lorato merge adapters into the base model - The merged model is saved to
output_dir/merged - A model card is generated automatically for HuggingFace Hub uploads