Workflow:Axolotl ai cloud Axolotl QLoRA SFT Finetuning

Knowledge Sources	Axolotl Axolotl Docs Configuration Reference Getting Started
Domains	LLMs, Fine_Tuning, QLoRA, Parameter_Efficient
Last Updated	2026-02-06 22:00 GMT

Overview

End-to-end process for parameter-efficient supervised fine-tuning (SFT) of large language models using QLoRA (Quantized Low-Rank Adapters) with Axolotl's YAML-driven configuration system.

Description

This workflow covers the most common Axolotl use case: fine-tuning an LLM on instruction-following or domain-specific data using QLoRA. The base model is loaded in 4-bit quantized form to minimize GPU memory, and small trainable LoRA adapter matrices are injected into the model's attention and feedforward layers. Only these adapter weights are trained, dramatically reducing memory and compute requirements. The workflow spans YAML configuration, optional dataset preprocessing, model loading with quantization, adapter injection, training with optimizations (flash attention, sample packing, gradient checkpointing), and saving the trained adapter weights. Optionally, the adapter can be merged back into the base model for deployment.

Usage

Execute this workflow when you have a labeled dataset (instruction-tuning, chat, or completion format) and need to adapt a base LLM to follow domain-specific instructions or adopt a particular response style, while operating under GPU memory constraints (e.g., a single GPU with 16-24GB VRAM). This is the recommended starting point for most Axolotl users.

Execution Steps

Step 1: Configuration

Create a YAML configuration file specifying the base model, dataset paths, adapter settings, and training hyperparameters. The configuration must define: the base model identifier (HuggingFace hub or local path), the dataset source and format type (e.g., alpaca, chat_template), QLoRA adapter parameters (rank, alpha, dropout, target modules), quantization settings (4-bit with NF4), sequence length, batch sizes, optimizer, learning rate schedule, and output directory.

Key considerations:

Set adapter: qlora and load_in_4bit: true
Choose appropriate lora_r (rank) and lora_alpha values
Define lora_target_modules to specify which layers receive adapters
Set sequence_len based on your data and available memory
Enable sample_packing: true and flash_attention: true for efficiency

Step 2: Configuration Validation

Axolotl validates the YAML configuration against the system's GPU capabilities before proceeding. This includes checking compute capability for bf16/fp8 support, verifying dataset paths exist, normalizing dtype settings, setting up distributed training environment variables, and registering any configured plugins.

Key considerations:

Validation is automatic when running axolotl train
Errors are raised early for invalid configurations
Use axolotl preprocess to validate data formatting separately

Step 3: Dataset Preparation

Load and preprocess the training dataset from local files, HuggingFace Hub, or cloud storage. Axolotl applies the configured prompt strategy (e.g., Alpaca template, chat template) to format each example, tokenizes the formatted prompts, applies train/validation splitting, and optionally enables sample packing to combine shorter sequences for efficient GPU utilization.

Key considerations:

Use axolotl preprocess for large datasets to precompute tokenization
Sample packing groups multiple examples into single sequences for higher throughput
The dataset_prepared_path config option caches preprocessed data for reuse
Support for multiple datasets with different formats in a single training run

Step 4: Model Loading and Quantization

Load the base model from HuggingFace Hub or local path with 4-bit NormalFloat (NF4) quantization applied on-the-fly. The model loader configures BitsAndBytes quantization, sets the compute dtype (typically bf16), applies attention mechanism patches (Flash Attention 2 or SDPA), and prepares the quantized model for adapter training.

Key considerations:

The base model weights remain frozen in 4-bit precision
Compute operations use higher precision (bf16) via double quantization
Flash Attention patches are applied for memory-efficient attention computation
Model configuration is validated against the tokenizer

Step 5: LoRA Adapter Injection

Inject low-rank adapter matrices into the specified model layers. For each target module (attention projections, feedforward layers), two small matrices A and B are added such that the effective weight becomes W' = W + BA. Only these small adapter matrices are trainable, typically representing less than 1% of the total model parameters.

Key considerations:

lora_target_modules controls which layers receive adapters
Alternatively, lora_target_linear: true targets all linear layers
The rank lora_r controls adapter capacity (higher = more parameters)
lora_alpha scales the adapter contribution

Step 6: Training Execution

Execute the training loop using HuggingFace's Trainer with Axolotl's customizations. The trainer handles gradient accumulation, mixed precision (bf16), gradient checkpointing for memory efficiency, learning rate scheduling (cosine with warmup), periodic evaluation, checkpointing, and optional integration with experiment trackers (WandB, MLflow).

Key considerations:

Gradient checkpointing trades compute for memory savings
The loss watchdog monitors for training instability
Checkpoints are saved at configured intervals for resumption
Training can be resumed from the last checkpoint if interrupted

Step 7: Model Saving and Export

Save the trained LoRA adapter weights, tokenizer, and configuration to the output directory. Optionally merge the adapter into the base model to produce a standalone model for deployment. The merged model can be uploaded to HuggingFace Hub or used directly for inference.

Key considerations:

By default, only adapter weights are saved (small files)
Use axolotl merge-lora to merge adapters into the base model
The merged model is saved to output_dir/merged
A model card is generated automatically for HuggingFace Hub uploads

Execution Diagram

GitHub URL

Workflow Repository