Implementation:Deepspeedai DeepSpeed HybridEngine Generate
Overview
Concrete tool for generating text using the Hybrid Engine's optimized inference mode for RLHF experience collection provided by the DeepSpeed library.
Description
DeepSpeedHybridEngine.eval() transitions to inference mode by performing parameter gathering, LoRA fusion, and inference container activation. DeepSpeedHybridEngine.generate() performs text generation using HuggingFace's generation API with DeepSpeed inference optimizations. It handles ZeRO-3 parameter gathering and tensor-parallel gather if inference_tp_size > 1.
The eval() method (L381-422) performs the mode switch:
- Logs performance statistics from the previous iteration (E2E latency, gather time, generate time, training time).
- Calls the parent
DeepSpeedEngine.eval()to set the model to evaluation mode. - For each transformer layer with an inference container, replaces the layer's forward function with the inference container's optimized forward. If ZeRO-3 is active without pinned parameters, a special
_zero3_forwardwrapper is used that gathers parameters on-the-fly per layer. - Calls
transform_for_inference()on each container to prepare kernel state. - For non-transformer layers (embeddings, layer norms), replaces forward functions with optimized inference layer equivalents.
- Triggers garbage collection and cache clearing for ZeRO-3 to free partitioned parameter memory.
The generate() method (L168-272) handles text generation:
- Records the total batch size for throughput logging.
- For ZeRO-3 with pinned parameters and TP, gathers parameters in partition groups, fuses LoRA weights, applies tensor parallelism slicing, then calls HuggingFace's
model.generate(). - For ZeRO-3 with pinned parameters without TP, gathers all parameters at once, fuses LoRA, and generates.
- For non-ZeRO-3 or non-pinned cases, fuses LoRA if applicable and generates directly.
- Optionally releases inference cache memory after generation.
Code Reference
| Property | Value |
|---|---|
| Repository | https://github.com/deepspeedai/DeepSpeed |
| File | deepspeed/runtime/hybrid_engine.py
|
| Lines | L381-422 (eval), L168-272 (generate)
|
| eval signature | def eval(self) -> None
|
| generate signature | def generate(self, *inputs, **kwargs) -> torch.Tensor
|
| Import | Accessed via engine returned by deepspeed.initialize()
|
I/O Contract
Inputs (eval)
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | Switches engine to inference mode; no arguments required |
Inputs (generate)
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.Tensor | Yes | Tokenized prompt input IDs |
| attention_mask | torch.Tensor | No | Attention mask for padded inputs |
| max_new_tokens | int | No | Maximum number of tokens to generate |
| temperature | float | No | Sampling temperature |
| top_p | float | No | Nucleus sampling probability threshold |
| do_sample | bool | No | Whether to use sampling (True) or greedy decoding (False) |
Outputs
| Name | Type | Description |
|---|---|---|
| generated_sequences | torch.Tensor | Generated token sequences including prompt tokens |
Usage Example
# Switch to inference mode
engine.eval()
# Generate experience rollouts
prompts = tokenizer(prompt_texts, return_tensors="pt", padding=True)
generated = engine.generate(
input_ids=prompts.input_ids.to(engine.device),
attention_mask=prompts.attention_mask.to(engine.device),
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
# Decode generated sequences
responses = tokenizer.batch_decode(generated, skip_special_tokens=True)
Related Pages
Knowledge Sources
Last updated: 2026-02-09 00:00 GMT