Heuristic:Axolotl ai cloud Axolotl Sample Packing Best Practices
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Memory_Management |
| Last Updated | 2026-02-06 22:33 GMT |
Overview
Configuration rules and requirements for sample packing to maximize GPU utilization while preventing memory leaks and cross-sample contamination.
Description
Sample packing (sequence packing) combines multiple shorter training examples into a single sequence up to the maximum length, dramatically improving GPU utilization by reducing padding waste. However, it has strict requirements around attention mechanisms, padding configuration, and compatibility with other features. Violating these rules leads to memory leaks, cross-sample contamination (training signals leaking between packed examples), or outright errors.
Usage
Apply these rules whenever sample_packing: true is set in the training configuration. Sample packing is the recommended approach for SFT training when examples vary significantly in length, but it requires careful configuration to work correctly.
The Insight (Rule of Thumb)
- Rule 1 - Require Optimized Attention: Sample packing MUST be used with one of: `flash_attention`, `sdp_attention`, `flex_attention`, or `xformers_attention`. Without these, cross-sample contamination occurs because standard attention cannot mask between packed samples.
- Rule 2 - Enable Padding: Always set `pad_to_sequence_len: true` when using sample packing. This prevents memory leaks by ensuring constant-sized buffers that allow efficient memory reuse.
- Rule 3 - No RL Training: Sample packing is incompatible with RL training (DPO, KTO, ORPO, GRPO). This combination raises a ValueError.
- Rule 4 - Eval Packing Alignment: If `sample_packing: true` and `eval_sample_packing` is not set, it defaults to true. If `eval_sample_packing: false`, then `remove_unused_columns` must be false.
- Rule 5 - No Eval Table with Packing: Cannot use `eval_table_size` when `sample_packing: true` unless `eval_sample_packing: false`.
- Rule 6 - Batch Flattening Exclusion: `batch_flattening` is incompatible with `sample_packing`.
- Value - Packing Parameters: Default `sample_packing_group_size: 100000` and `sample_packing_bin_size: 200`. Increasing group_size gives less than 1% improvement. Increase bin_size for large sequence_len with many short samples.
- Trade-off: Higher GPU utilization and faster training vs. increased configuration complexity and attention mechanism requirements.
Reasoning
Sample packing uses First-Fit Decreasing (FFD) bin packing to combine variable-length sequences. Without optimized attention that supports per-sample masking (Flash Attention, SDP, etc.), the self-attention mechanism allows tokens from different samples to attend to each other, causing cross-sample contamination that corrupts training signals.
The `pad_to_sequence_len` requirement prevents memory fragmentation: without padding, each batch can have a different tensor size, causing PyTorch's CUDA memory allocator to fragment VRAM over time, eventually leading to OOM errors even when total memory usage is within limits. Constant-sized tensors allow memory blocks to be reused efficiently.
The RL incompatibility exists because RL training (DPO, etc.) requires pairs of examples (chosen/rejected) to be processed together, which conflicts with the arbitrary packing order.
Code Evidence
Sample packing attention requirement from `src/axolotl/utils/schemas/validation.py:181-192`:
@model_validator(mode="before")
@classmethod
def check_sample_packing_without_attention(cls, data):
if (
data.get("sample_packing")
and not data.get("flash_attention")
and not data.get("sdp_attention")
and not data.get("flex_attention")
and not data.get("xformers_attention")
):
LOG.warning(
"sample_packing without flash, sdp, xformers or flex attention does not handle cross sample decontamination."
)
Padding requirement from `src/axolotl/utils/schemas/validation.py:230-241`:
@model_validator(mode="before")
@classmethod
def hint_sample_packing_padding(cls, data):
if data.get("sample_packing"):
pad_to_sequence_len = data.get("pad_to_sequence_len")
if pad_to_sequence_len is False:
LOG.warning(
"`pad_to_sequence_len: true` is recommended when using sample_packing"
)
elif pad_to_sequence_len is None:
LOG.info(
"Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing"
)
data["pad_to_sequence_len"] = True
RL incompatibility from `src/axolotl/utils/schemas/validation.py:697-700`:
@model_validator(mode="before")
@classmethod
def check_sample_packing_w_rl(cls, data):
if data.get("sample_packing") and data.get("rl"):
raise ValueError("`sample_packing: true` does not work with RLHF training")