Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Axolotl ai cloud Axolotl Memory Optimization Tips

From Leeroopedia




Knowledge Sources
Domains Optimization, Memory_Management, Debugging
Last Updated 2026-02-06 22:33 GMT

Overview

Memory management heuristics for preventing VRAM instability, OOM errors, and fragmentation during LLM fine-tuning with Axolotl.

Description

GPU memory management is the primary bottleneck in LLM fine-tuning. Axolotl encodes several hard-won rules about batch sizing, padding, dtype selection, and optimizer configuration that prevent common memory-related failures. These rules address VRAM instability from mismatched batch sizes, memory fragmentation from variable-length sequences, and dtype-related errors during full fine-tuning.

Usage

Apply these rules when experiencing CUDA OOM errors, VRAM instability during evaluation, or when configuring training for the first time. Particularly important for consumer GPUs (24GB VRAM or less) and large models (7B+ parameters).

The Insight (Rule of Thumb)

  • Rule 1 - Batch Size Calculation: Never set `batch_size` directly. Instead, set `micro_batch_size` and `gradient_accumulation_steps`. Axolotl calculates: `batch_size = micro_batch_size * gradient_accumulation_steps`. To calculate equivalent grad accum steps: `gradient_accumulation_steps = batch_size / micro_batch_size / number_of_gpus`.
  • Rule 2 - Eval Batch Size Consistency: Set `eval_batch_size` equal to `micro_batch_size`. Mismatch causes VRAM instability because the memory profile changes between training and evaluation phases.
  • Rule 3 - Constant Buffer Padding: Set `pad_to_sequence_len: true` to use constant-sized buffers. This reduces memory fragmentation and prevents OOMs by allowing efficient memory block reuse.
  • Rule 4 - FP16 Full Fine-tune Danger: Full fine-tuning + FP16 + Sample Packing without Flash Attention causes dtype errors (`Attempting to unscale FP16 gradients` or `expected mat1 and mat2 to have the same dtype`). Solution: use LoRA adapter instead, or enable Flash Attention.
  • Rule 5 - Auto BF16 Detection: When `bf16: auto`, Axolotl auto-detects GPU bf16 support. If unsupported, falls back to FP16. BF16 is preferred when available.
  • Rule 6 - FP8 Requires torch_compile: FP8 training without `torch_compile: true` will not see speed improvements. Strongly recommended to enable torch.compile for FP8.
  • Rule 7 - CUDA Allocator Config: Axolotl auto-sets `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,roundup_power2_divisions:16` for torch >= 2.2 to reduce fragmentation.
  • Trade-off: Padding wastes some compute on pad tokens, but prevents catastrophic memory fragmentation. BF16 has less precision than FP32 but uses half the memory.

Reasoning

CUDA memory allocation is block-based. When tensor sizes vary between steps (due to variable-length sequences without padding), the allocator must allocate new blocks of different sizes, eventually fragmenting the VRAM into small unusable chunks. Even though total free memory may be sufficient, no single contiguous block is large enough for the next allocation, causing OOM. Constant-sized tensors (via padding) eliminate this fragmentation.

The eval batch size mismatch issue occurs because training allocates memory for `micro_batch_size` tensors, but evaluation may allocate for a different size, creating a different memory profile that fragments the existing allocations.

The FP16 full fine-tune error occurs because the gradient unscaling step in mixed-precision training expects FP32 master weights, but full fine-tuning stores weights in FP16, causing dtype mismatches in the gradient computation path.

Code Evidence

Batch size deprecation warning from `src/axolotl/utils/schemas/training.py:166-175`:

@field_validator("batch_size")
@classmethod
def hint_batch_size_set(cls, batch_size):
    if batch_size:
        LOG.warning(
            "%s\n%s",
            "batch_size is not recommended. Please use gradient_accumulation_steps instead.",
            "To calculate the equivalent gradient_accumulation_steps, "
            "divide batch_size / micro_batch_size / number of gpus.",
        )

Eval batch size mismatch warning from `src/axolotl/utils/schemas/validation.py:266-275`:

if (
    data.get("eval_batch_size")
    and data.get("micro_batch_size")
    and data.get("eval_batch_size") != data.get("micro_batch_size")
):
    LOG.warning(
        "eval_batch_size != micro_batch_size. This can lead to VRAM instability."
    )

FP16 full fine-tune warning from `src/axolotl/utils/schemas/validation.py:391-405`:

if (
    not (self.bf16 or self.bfloat16)
    and (self.fp16 or self.float16)
    and not self.adapter
    and not self.flash_attention
    and self.sample_packing
):
    LOG.warning(
        "Full fine tune w/o FA2 w/ sample packing and fp16/float16 is likely to raise errors. Try LoRA."
    )
    # ValueError: Attempting to unscale FP16 gradients.
    # OR
    # RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

FP8 torch_compile recommendation from `src/axolotl/utils/schemas/validation.py:409-425`:

if data.get("fp8") and not data.get("torch_compile"):
    LOG.warning(
        "torch_compile is strongly recommended for FP8 training in order to "
        "see speed improvements."
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment