Heuristic:OpenRLHF OpenRLHF Packing Samples Efficiency Tip
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Deep_Learning |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
Enable `--packing_samples` to concatenate multiple sequences into fixed-length batches, eliminating padding waste and improving GPU utilization.
Description
Standard batching pads all sequences to the longest length in the batch, wasting GPU computation on padding tokens. Sample packing concatenates multiple shorter sequences into a single fixed-length input, using attention masking (via Flash Attention) to prevent cross-contamination between packed sequences. This dramatically improves GPU utilization, especially when sequence lengths vary widely. OpenRLHF requires Flash Attention 2 for packing and automatically enforces `use_cache=False` during training.
Usage
Use this heuristic for all training workflows where sequence lengths vary significantly. Enable with `--packing_samples`. This is a standard best practice recommended in OpenRLHF documentation. Combine with `--use_dynamic_batch` for maximum efficiency with variable-length data.
The Insight (Rule of Thumb)
- Action: Add `--packing_samples` to the training command.
- Value: Can improve throughput by 20-50%+ depending on sequence length variance.
- Trade-off: Requires Flash Attention 2+ (auto-enforced). KV cache must be disabled during training.
- Interaction: Combine with `--use_dynamic_batch` which sets micro batch size to 1 for variable-length packed sequences.
Reasoning
In a typical RLHF dataset, response lengths vary dramatically (e.g., 50 to 2048 tokens). Without packing, a batch with one 2048-token sequence and seven 50-token sequences wastes ~97% of compute on padding. Packing concatenates these into a single 2398-token sequence, using Flash Attention's variable-length masking to keep sequences isolated. The `use_cache=False` requirement exists because packed sequences have variable internal structure incompatible with KV cache assumptions.
Code evidence for packing flag from `openrlhf/cli/train_dpo.py:256-257`:
# packing samples using Flash Attention2
parser.add_argument("--packing_samples", action="store_true", default=False)
Flash Attention enforcement from `openrlhf/cli/train_dpo.py:314-316`:
if args.packing_samples and "flash_attention" not in args.attn_implementation:
print("[Warning] Please use --attn_implementation with flash_attention...")
args.attn_implementation = "flash_attention_2"
KV cache disabled for training from `openrlhf/models/model.py:151-152`:
# https://github.com/huggingface/transformers/issues/26877
model.config.use_cache = False
Dynamic batch interaction from `openrlhf/utils/deepspeed/deepspeed.py:267-268`:
if self.use_dynamic_batch:
ds_config["train_micro_batch_size_per_gpu"] = 1