Heuristic:Unslothai Unsloth Gradient Checkpointing Modes
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
Unsloth's `"unsloth"` gradient checkpointing mode offloads embeddings to disk for 50-60% VRAM savings on sequences >= 512 tokens, but falls back to standard checkpointing for shorter sequences where the overhead is not worth it.
Description
Unsloth provides three gradient checkpointing modes: `False` (disabled), `True` (standard PyTorch gradient checkpointing), and `"unsloth"` (enhanced mode with embedding offloading). The `"unsloth"` mode offloads input and output embeddings to disk during training, then reloads them for the backward pass. This provides significant VRAM savings but adds I/O overhead. A smart heuristic automatically downgrades from `"unsloth"` to `True` when `max_seq_length < 512`, as benchmarks show the crossover point where offloading becomes beneficial is around 384-512 tokens.
Usage
Use `"unsloth"` mode (the default) when fine-tuning models with sequence lengths >= 512 tokens, which is the common case for instruction tuning and RLHF. Switch to `True` for short-sequence tasks (classification, short QA) where the embedding offload overhead exceeds the memory savings. Set to `False` only if you have abundant VRAM and want maximum training speed.
The Insight (Rule of Thumb)
- Action: Set `use_gradient_checkpointing="unsloth"` in `get_peft_model()` (this is the default).
- Value: The system auto-selects: `"unsloth"` for seq_len >= 512, `True` for seq_len < 512.
- Trade-off: `"unsloth"` mode reduces VRAM by ~50-60% at the cost of disk I/O for embedding offload. Standard `True` mode reduces VRAM by ~30-40% with ~20% slower training.
- Compatibility: Works with all Transformer models supported by Unsloth. Requires `use_reentrant=True` for gradient checkpointing.
Reasoning
The `"unsloth"` mode offloads the input and output embedding matrices to disk, which are among the largest single tensors in a model (vocabulary_size x hidden_dim). For a 7B model with 128K vocab, embeddings consume ~1GB of VRAM. For sequences >= 512 tokens, the activation memory dominance means the I/O cost of offloading is amortized over the larger batch computation. For shorter sequences, the activation memory is small enough that the disk I/O overhead exceeds the VRAM savings, making standard gradient checkpointing more efficient.
Code evidence from `models/_utils.py:156-187`:
def apply_unsloth_gradient_checkpointing(
use_gradient_checkpointing, max_seq_length, dtype
):
if use_gradient_checkpointing == "unsloth":
# Gradient offloading overhead is not worth it for small sequences.
# Benchmarks show crossover point is around seq_len 384-512.
if max_seq_length < 512:
unpatch_unsloth_smart_gradient_checkpointing()
return True
else:
patch_unsloth_smart_gradient_checkpointing(dtype = dtype)
return "unsloth"
elif use_gradient_checkpointing in (True, False):
unpatch_unsloth_smart_gradient_checkpointing()
return use_gradient_checkpointing
return use_gradient_checkpointing
Embedding offloading from `models/llama.py:3014-3031`:
if use_gradient_checkpointing == "unsloth":
if train_embed_tokens:
print("Unsloth: Offloading input_embeddings to disk to save VRAM")
offload_input_embeddings(model, temporary_location)
for _ in range(3):
gc.collect()
clean_gpu_cache()
if train_lm_head:
print("Unsloth: Offloading output_embeddings to disk to save VRAM")
offload_output_embeddings(model, temporary_location)