Heuristic:Unslothai Unsloth Padding Free Packing
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Training, Memory_Management |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
Unsloth auto-enables padding-free batching for 2x+ faster training, but disables it for VLMs, Gemma 2, GPT-OSS, custom data collators, and when forced logit return is active.
Description
Padding-free batching eliminates wasted computation on padding tokens by concatenating sequences and tracking boundaries via `packed_seq_lengths` metadata. Unsloth auto-enables this unless blocked by known incompatibilities. When packing is explicitly enabled via `SFTConfig(packing=True)`, sequences are fully packed into fixed-length batches. When only padding-free is enabled (auto or explicit), each sequence retains its individual identity but padding tokens are removed. Both modes set `max_seq_length = infinity` to allow overlength sequences, and inject sequence length metadata into the data collator.
Usage
Let Unsloth auto-enable padding-free (the default) for most text-only SFT workflows. Explicitly set `packing=True` in SFTConfig for maximum throughput (2x+ speedup). Disable with `UNSLOTH_DISABLE_AUTO_PADDING_FREE=1` env var if you observe training instability. Packing is automatically disabled for:
- Vision-language models (VLMs)
- Gemma 2 (slow_attention_softcapping issues)
- GPT-OSS (Flex Attention incompatibility)
- Custom data collators
- Forced logit return (`UNSLOTH_RETURN_LOGITS=1`)
The Insight (Rule of Thumb)
- Action: Trust Unsloth's auto-detection for padding-free. Set `packing=True` for explicit maximum speedup on text-only SFT.
- Value: 2x+ training speedup from eliminating padding computation.
- Trade-off: Packing changes the effective batch composition (multiple sequences per batch slot). Some custom collators or evaluation setups may break. Padding-free mode is safer (no sequence mixing) but still removes padding waste.
- Compatibility: Blocked for VLMs, Gemma 2, GPT-OSS. Graceful fallback if packing fails at runtime.
Reasoning
In standard training, short sequences are padded to `max_seq_length`, wasting ~30-70% of compute on padding tokens. Padding-free mode eliminates this waste while padding mode goes further by packing multiple sequences into a single training slot. The blocklist exists because certain models have attention implementations that cannot handle variable-length sequences (Gemma 2's softcap attention, GPT-OSS's Flex Attention). The graceful fallback ensures that if packing causes a ValueError at trainer initialization, it is silently disabled rather than crashing.
Blocklist from `trainer.py:57-60`:
PADDING_FREE_BLOCKLIST = {
"gemma2", # Uses slow_attention_softcapping with torch.compile issues
"gpt_oss", # Uses Flex Attention which doesn't handle padding_free correctly
}
Auto-detection logic from `trainer.py:69-76`:
def _should_auto_padding_free(config) -> bool:
if (
config is None
or _AUTO_PADDING_FREE_ENV_DISABLED
or getattr(config, "packing", False)
):
return False
return not getattr(config, "padding_free", False)
Blocked conditions from `trainer.py:317-340`:
blocked = (
(data_collator is not None)
or isinstance(processing_class, ProcessorMixin)
or is_vlm
or is_unsupported_model
or (os.environ.get("UNSLOTH_RETURN_LOGITS", "0") == "1")
)
Graceful fallback from `trainer.py:366-375`:
try:
original_init(self, *args, **kwargs)
except ValueError as exc:
if packing_active and _should_skip_auto_packing_error(exc):
_disable_sample_packing(config_arg)
packing_active = False
original_init(self, *args, **kwargs)
else:
raise