Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Axolotl ai cloud Axolotl SFT Training Execution

From Leeroopedia


Knowledge Sources
Domains Training, Supervised_Finetuning, Deep_Learning
Last Updated 2026-02-06 23:00 GMT

Overview

The supervised fine-tuning training loop that optimizes model parameters on instruction-response pairs using standard cross-entropy loss with gradient-based optimization.

Description

Supervised Fine-Tuning (SFT) Training Execution is the process of running the actual training loop that updates model parameters. Given a prepared model (with or without LoRA adapters) and tokenized datasets, this stage constructs the trainer, configures the optimizer and learning rate scheduler, and executes the gradient descent loop.

In Axolotl, training execution involves two key abstractions: the HFCausalTrainerBuilder which assembles all training components (model, data collator, callbacks, optimizer) into an AxolotlTrainer instance, and the train orchestration function which manages the full lifecycle including checkpoint resumption, evaluation, and model saving.

The AxolotlTrainer extends HuggingFace's Trainer with Axolotl-specific features: sample packing, custom optimizers (ScheduleFree, Lion, ADOPT), multipack batching, activation offloading, and distributed parallel save handling.

Usage

Use SFT training execution when:

  • Fine-tuning a language model on instruction-following data
  • Running standard causal language modeling training
  • Training with LoRA, QLoRA, or full fine-tuning
  • Needing managed checkpointing, logging, and evaluation

Theoretical Basis

SFT optimizes the standard causal language modeling objective:

=t=1TlogP(xt|x<t;θ)

Where θ are the trainable parameters and the loss is computed only on response tokens (not prompt tokens) for instruction tuning.

Training loop pseudo-code:

# Abstract SFT training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        outputs = model(batch.input_ids, attention_mask=batch.attention_mask)
        loss = cross_entropy(outputs.logits, batch.labels)
        loss.backward()
        if step % gradient_accumulation_steps == 0:
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
        if step % save_steps == 0:
            save_checkpoint()

Key optimizations in Axolotl:

  • Sample Packing: Concatenate multiple short sequences into one batch to reduce padding waste
  • Gradient Checkpointing: Trade compute for memory by recomputing activations
  • Mixed Precision: BF16/FP16 training for faster computation
  • Flash Attention: Fused attention kernels for memory and speed

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment