Implementation:Deepspeedai DeepSpeed HybridEngine Generate

Overview

Concrete tool for generating text using the Hybrid Engine's optimized inference mode for RLHF experience collection provided by the DeepSpeed library.

Description

DeepSpeedHybridEngine.eval() transitions to inference mode by performing parameter gathering, LoRA fusion, and inference container activation. DeepSpeedHybridEngine.generate() performs text generation using HuggingFace's generation API with DeepSpeed inference optimizations. It handles ZeRO-3 parameter gathering and tensor-parallel gather if inference_tp_size > 1.

The eval() method (L381-422) performs the mode switch:

Logs performance statistics from the previous iteration (E2E latency, gather time, generate time, training time).
Calls the parent DeepSpeedEngine.eval() to set the model to evaluation mode.
For each transformer layer with an inference container, replaces the layer's forward function with the inference container's optimized forward. If ZeRO-3 is active without pinned parameters, a special _zero3_forward wrapper is used that gathers parameters on-the-fly per layer.
Calls transform_for_inference() on each container to prepare kernel state.
For non-transformer layers (embeddings, layer norms), replaces forward functions with optimized inference layer equivalents.
Triggers garbage collection and cache clearing for ZeRO-3 to free partitioned parameter memory.

The generate() method (L168-272) handles text generation:

Records the total batch size for throughput logging.
For ZeRO-3 with pinned parameters and TP, gathers parameters in partition groups, fuses LoRA weights, applies tensor parallelism slicing, then calls HuggingFace's model.generate().
For ZeRO-3 with pinned parameters without TP, gathers all parameters at once, fuses LoRA, and generates.
For non-ZeRO-3 or non-pinned cases, fuses LoRA if applicable and generates directly.
Optionally releases inference cache memory after generation.

Code Reference

Property	Value
Repository	https://github.com/deepspeedai/DeepSpeed
File	`deepspeed/runtime/hybrid_engine.py`
Lines	L381-422 (`eval`), L168-272 (`generate`)
eval signature	`def eval(self) -> None`
generate signature	`def generate(self, inputs, *kwargs) -> torch.Tensor`
Import	Accessed via engine returned by `deepspeed.initialize()`

I/O Contract

Inputs (eval)

Name	Type	Required	Description
(none)	—	—	Switches engine to inference mode; no arguments required

Inputs (generate)

Name	Type	Required	Description
input_ids	torch.Tensor	Yes	Tokenized prompt input IDs
attention_mask	torch.Tensor	No	Attention mask for padded inputs
max_new_tokens	int	No	Maximum number of tokens to generate
temperature	float	No	Sampling temperature
top_p	float	No	Nucleus sampling probability threshold
do_sample	bool	No	Whether to use sampling (True) or greedy decoding (False)

Outputs

Name	Type	Description
generated_sequences	torch.Tensor	Generated token sequences including prompt tokens

Usage Example

# Switch to inference mode
engine.eval()

# Generate experience rollouts
prompts = tokenizer(prompt_texts, return_tensors="pt", padding=True)
generated = engine.generate(
    input_ids=prompts.input_ids.to(engine.device),
    attention_mask=prompts.attention_mask.to(engine.device),
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
# Decode generated sequences
responses = tokenizer.batch_decode(generated, skip_special_tokens=True)

Related Pages

Principle:Deepspeedai_DeepSpeed_RLHF_Experience_Generation

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment