Principle:Axolotl ai cloud Axolotl SFT Training Execution
| Knowledge Sources | |
|---|---|
| Domains | Training, Supervised_Finetuning, Deep_Learning |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
The supervised fine-tuning training loop that optimizes model parameters on instruction-response pairs using standard cross-entropy loss with gradient-based optimization.
Description
Supervised Fine-Tuning (SFT) Training Execution is the process of running the actual training loop that updates model parameters. Given a prepared model (with or without LoRA adapters) and tokenized datasets, this stage constructs the trainer, configures the optimizer and learning rate scheduler, and executes the gradient descent loop.
In Axolotl, training execution involves two key abstractions: the HFCausalTrainerBuilder which assembles all training components (model, data collator, callbacks, optimizer) into an AxolotlTrainer instance, and the train orchestration function which manages the full lifecycle including checkpoint resumption, evaluation, and model saving.
The AxolotlTrainer extends HuggingFace's Trainer with Axolotl-specific features: sample packing, custom optimizers (ScheduleFree, Lion, ADOPT), multipack batching, activation offloading, and distributed parallel save handling.
Usage
Use SFT training execution when:
- Fine-tuning a language model on instruction-following data
- Running standard causal language modeling training
- Training with LoRA, QLoRA, or full fine-tuning
- Needing managed checkpointing, logging, and evaluation
Theoretical Basis
SFT optimizes the standard causal language modeling objective:
Where are the trainable parameters and the loss is computed only on response tokens (not prompt tokens) for instruction tuning.
Training loop pseudo-code:
# Abstract SFT training loop
for epoch in range(num_epochs):
for batch in dataloader:
outputs = model(batch.input_ids, attention_mask=batch.attention_mask)
loss = cross_entropy(outputs.logits, batch.labels)
loss.backward()
if step % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
if step % save_steps == 0:
save_checkpoint()
Key optimizations in Axolotl:
- Sample Packing: Concatenate multiple short sequences into one batch to reduce padding waste
- Gradient Checkpointing: Trade compute for memory by recomputing activations
- Mixed Precision: BF16/FP16 training for faster computation
- Flash Attention: Fused attention kernels for memory and speed