Heuristic:Unslothai Unsloth Padding Free Packing

Knowledge Sources	Unsloth Unsloth benchmarks
Domains	Optimization, Training, Memory_Management
Last Updated	2026-02-07 09:00 GMT

Overview

Unsloth auto-enables padding-free batching for 2x+ faster training, but disables it for VLMs, Gemma 2, GPT-OSS, custom data collators, and when forced logit return is active.

Description

Padding-free batching eliminates wasted computation on padding tokens by concatenating sequences and tracking boundaries via `packed_seq_lengths` metadata. Unsloth auto-enables this unless blocked by known incompatibilities. When packing is explicitly enabled via `SFTConfig(packing=True)`, sequences are fully packed into fixed-length batches. When only padding-free is enabled (auto or explicit), each sequence retains its individual identity but padding tokens are removed. Both modes set `max_seq_length = infinity` to allow overlength sequences, and inject sequence length metadata into the data collator.

Usage

Let Unsloth auto-enable padding-free (the default) for most text-only SFT workflows. Explicitly set `packing=True` in SFTConfig for maximum throughput (2x+ speedup). Disable with `UNSLOTH_DISABLE_AUTO_PADDING_FREE=1` env var if you observe training instability. Packing is automatically disabled for:

Vision-language models (VLMs)
Gemma 2 (slow_attention_softcapping issues)
GPT-OSS (Flex Attention incompatibility)
Custom data collators
Forced logit return (`UNSLOTH_RETURN_LOGITS=1`)

The Insight (Rule of Thumb)

Action: Trust Unsloth's auto-detection for padding-free. Set `packing=True` for explicit maximum speedup on text-only SFT.
Value: 2x+ training speedup from eliminating padding computation.
Trade-off: Packing changes the effective batch composition (multiple sequences per batch slot). Some custom collators or evaluation setups may break. Padding-free mode is safer (no sequence mixing) but still removes padding waste.
Compatibility: Blocked for VLMs, Gemma 2, GPT-OSS. Graceful fallback if packing fails at runtime.

Reasoning

In standard training, short sequences are padded to `max_seq_length`, wasting ~30-70% of compute on padding tokens. Padding-free mode eliminates this waste while padding mode goes further by packing multiple sequences into a single training slot. The blocklist exists because certain models have attention implementations that cannot handle variable-length sequences (Gemma 2's softcap attention, GPT-OSS's Flex Attention). The graceful fallback ensures that if packing causes a ValueError at trainer initialization, it is silently disabled rather than crashing.

Blocklist from `trainer.py:57-60`:

PADDING_FREE_BLOCKLIST = {
    "gemma2",   # Uses slow_attention_softcapping with torch.compile issues
    "gpt_oss",  # Uses Flex Attention which doesn't handle padding_free correctly
}

Auto-detection logic from `trainer.py:69-76`:

def _should_auto_padding_free(config) -> bool:
    if (
        config is None
        or _AUTO_PADDING_FREE_ENV_DISABLED
        or getattr(config, "packing", False)
    ):
        return False
    return not getattr(config, "padding_free", False)

Blocked conditions from `trainer.py:317-340`:

blocked = (
    (data_collator is not None)
    or isinstance(processing_class, ProcessorMixin)
    or is_vlm
    or is_unsupported_model
    or (os.environ.get("UNSLOTH_RETURN_LOGITS", "0") == "1")
)

Graceful fallback from `trainer.py:366-375`:

try:
    original_init(self, *args, **kwargs)
except ValueError as exc:
    if packing_active and _should_skip_auto_packing_error(exc):
        _disable_sample_packing(config_arg)
        packing_active = False
        original_init(self, *args, **kwargs)
    else:
        raise

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment