Principle:Deepspeedai DeepSpeed RLHF Experience Generation
Overview
Generating text rollouts from the actor policy using optimized inference mode within the Hybrid Engine for RLHF experience collection.
Description
Experience generation is the inference phase of RLHF where the actor model generates text responses to prompts. The Hybrid Engine switches to inference mode via engine.eval(), which performs the following transitions:
- Gathers ZeRO-3 partitioned parameters: When using ZeRO Stage 3, model parameters are partitioned across data-parallel ranks. For inference, these must be gathered so each rank has the full model weights.
- Fuses LoRA adapters into base weights: If LoRA is active, the adapter weights (low-rank matrices A and B) are multiplied and added to the base weights, eliminating the overhead of separate adapter computation during generation.
- Replaces forward functions with optimized inference containers: Each transformer layer's forward method is swapped to use the DeepSpeed inference container, which applies fused kernels for attention, projection, and normalization.
- Optionally sets up tensor parallelism for generation: If
inference_tp_size > 1, input tensors are gathered across TP ranks and the generation is distributed.
After the mode switch, the engine.generate() method produces text using HuggingFace's generation API with DeepSpeed optimizations active. The generated sequences, along with their log-probabilities, form the "experience" that is used for the subsequent PPO policy update.
The quality and speed of experience generation directly impact RLHF training efficiency. Faster generation means more experience per unit time, which translates to more PPO updates and faster convergence. The Hybrid Engine's optimized inference path can achieve significant speedups over naive PyTorch generation through kernel fusion and efficient memory management.
Theoretical Basis
In PPO for RLHF, the actor policy generates rollouts (text sequences) that are scored by the reward model. The generation process samples from the actor's conditional distribution:
y_t ~ pi_theta(. | x, y_<t)
where pi_theta is the actor policy parameterized by theta, x is the input prompt, and y_<t is the text generated so far. The generation typically uses sampling with temperature and top-p (nucleus sampling) to balance diversity and quality.
The generated sequences are then scored by the reward model to produce reward signals, and the reference model's log-probabilities are computed for the KL penalty term. Together, these form the complete experience tuple (prompt, response, reward, log_probs, ref_log_probs) used for the PPO update.
Optimized inference (kernel fusion, tensor parallelism) reduces generation latency, which is critical because generation is typically the bottleneck in the RLHF training loop, often consuming 50% or more of total iteration time.
References
- Proximal Policy Optimization Algorithms — https://arxiv.org/abs/1707.06347
- InstructGPT: Training language models to follow instructions with human feedback — https://arxiv.org/abs/2203.02155
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://arxiv.org/abs/1707.06347
- https://arxiv.org/abs/2203.02155
- https://arxiv.org/abs/2308.01320
Last updated: 2026-02-09 00:00 GMT