Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:OpenRLHF OpenRLHF Packing Samples Efficiency Tip

From Leeroopedia




Knowledge Sources
Domains Optimization, LLMs, Deep_Learning
Last Updated 2026-02-07 10:00 GMT

Overview

Enable `--packing_samples` to concatenate multiple sequences into fixed-length batches, eliminating padding waste and improving GPU utilization.

Description

Standard batching pads all sequences to the longest length in the batch, wasting GPU computation on padding tokens. Sample packing concatenates multiple shorter sequences into a single fixed-length input, using attention masking (via Flash Attention) to prevent cross-contamination between packed sequences. This dramatically improves GPU utilization, especially when sequence lengths vary widely. OpenRLHF requires Flash Attention 2 for packing and automatically enforces `use_cache=False` during training.

Usage

Use this heuristic for all training workflows where sequence lengths vary significantly. Enable with `--packing_samples`. This is a standard best practice recommended in OpenRLHF documentation. Combine with `--use_dynamic_batch` for maximum efficiency with variable-length data.

The Insight (Rule of Thumb)

  • Action: Add `--packing_samples` to the training command.
  • Value: Can improve throughput by 20-50%+ depending on sequence length variance.
  • Trade-off: Requires Flash Attention 2+ (auto-enforced). KV cache must be disabled during training.
  • Interaction: Combine with `--use_dynamic_batch` which sets micro batch size to 1 for variable-length packed sequences.

Reasoning

In a typical RLHF dataset, response lengths vary dramatically (e.g., 50 to 2048 tokens). Without packing, a batch with one 2048-token sequence and seven 50-token sequences wastes ~97% of compute on padding. Packing concatenates these into a single 2398-token sequence, using Flash Attention's variable-length masking to keep sequences isolated. The `use_cache=False` requirement exists because packed sequences have variable internal structure incompatible with KV cache assumptions.

Code evidence for packing flag from `openrlhf/cli/train_dpo.py:256-257`:

# packing samples using Flash Attention2
parser.add_argument("--packing_samples", action="store_true", default=False)

Flash Attention enforcement from `openrlhf/cli/train_dpo.py:314-316`:

if args.packing_samples and "flash_attention" not in args.attn_implementation:
    print("[Warning] Please use --attn_implementation with flash_attention...")
    args.attn_implementation = "flash_attention_2"

KV cache disabled for training from `openrlhf/models/model.py:151-152`:

# https://github.com/huggingface/transformers/issues/26877
model.config.use_cache = False

Dynamic batch interaction from `openrlhf/utils/deepspeed/deepspeed.py:267-268`:

if self.use_dynamic_batch:
    ds_config["train_micro_batch_size_per_gpu"] = 1

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment