Heuristic:Unslothai Unsloth Gradient Checkpointing Modes

Knowledge Sources	Unsloth Unsloth benchmarks
Domains	Optimization, Memory_Management
Last Updated	2026-02-07 09:00 GMT

Overview

Unsloth's `"unsloth"` gradient checkpointing mode offloads embeddings to disk for 50-60% VRAM savings on sequences >= 512 tokens, but falls back to standard checkpointing for shorter sequences where the overhead is not worth it.

Description

Unsloth provides three gradient checkpointing modes: `False` (disabled), `True` (standard PyTorch gradient checkpointing), and `"unsloth"` (enhanced mode with embedding offloading). The `"unsloth"` mode offloads input and output embeddings to disk during training, then reloads them for the backward pass. This provides significant VRAM savings but adds I/O overhead. A smart heuristic automatically downgrades from `"unsloth"` to `True` when `max_seq_length < 512`, as benchmarks show the crossover point where offloading becomes beneficial is around 384-512 tokens.

Usage

Use `"unsloth"` mode (the default) when fine-tuning models with sequence lengths >= 512 tokens, which is the common case for instruction tuning and RLHF. Switch to `True` for short-sequence tasks (classification, short QA) where the embedding offload overhead exceeds the memory savings. Set to `False` only if you have abundant VRAM and want maximum training speed.

The Insight (Rule of Thumb)

Action: Set `use_gradient_checkpointing="unsloth"` in `get_peft_model()` (this is the default).
Value: The system auto-selects: `"unsloth"` for seq_len >= 512, `True` for seq_len < 512.
Trade-off: `"unsloth"` mode reduces VRAM by ~50-60% at the cost of disk I/O for embedding offload. Standard `True` mode reduces VRAM by ~30-40% with ~20% slower training.
Compatibility: Works with all Transformer models supported by Unsloth. Requires `use_reentrant=True` for gradient checkpointing.

Reasoning

The `"unsloth"` mode offloads the input and output embedding matrices to disk, which are among the largest single tensors in a model (vocabulary_size x hidden_dim). For a 7B model with 128K vocab, embeddings consume ~1GB of VRAM. For sequences >= 512 tokens, the activation memory dominance means the I/O cost of offloading is amortized over the larger batch computation. For shorter sequences, the activation memory is small enough that the disk I/O overhead exceeds the VRAM savings, making standard gradient checkpointing more efficient.

Code evidence from `models/_utils.py:156-187`:

def apply_unsloth_gradient_checkpointing(
    use_gradient_checkpointing, max_seq_length, dtype
):
    if use_gradient_checkpointing == "unsloth":
        # Gradient offloading overhead is not worth it for small sequences.
        # Benchmarks show crossover point is around seq_len 384-512.
        if max_seq_length < 512:
            unpatch_unsloth_smart_gradient_checkpointing()
            return True
        else:
            patch_unsloth_smart_gradient_checkpointing(dtype = dtype)
            return "unsloth"
    elif use_gradient_checkpointing in (True, False):
        unpatch_unsloth_smart_gradient_checkpointing()
        return use_gradient_checkpointing
    return use_gradient_checkpointing

Embedding offloading from `models/llama.py:3014-3031`:

if use_gradient_checkpointing == "unsloth":
    if train_embed_tokens:
        print("Unsloth: Offloading input_embeddings to disk to save VRAM")
        offload_input_embeddings(model, temporary_location)
    for _ in range(3):
        gc.collect()
        clean_gpu_cache()
    if train_lm_head:
        print("Unsloth: Offloading output_embeddings to disk to save VRAM")
        offload_output_embeddings(model, temporary_location)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment