Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed HybridEngine Generate

From Leeroopedia


Overview

Concrete tool for generating text using the Hybrid Engine's optimized inference mode for RLHF experience collection provided by the DeepSpeed library.

Description

DeepSpeedHybridEngine.eval() transitions to inference mode by performing parameter gathering, LoRA fusion, and inference container activation. DeepSpeedHybridEngine.generate() performs text generation using HuggingFace's generation API with DeepSpeed inference optimizations. It handles ZeRO-3 parameter gathering and tensor-parallel gather if inference_tp_size > 1.

The eval() method (L381-422) performs the mode switch:

  • Logs performance statistics from the previous iteration (E2E latency, gather time, generate time, training time).
  • Calls the parent DeepSpeedEngine.eval() to set the model to evaluation mode.
  • For each transformer layer with an inference container, replaces the layer's forward function with the inference container's optimized forward. If ZeRO-3 is active without pinned parameters, a special _zero3_forward wrapper is used that gathers parameters on-the-fly per layer.
  • Calls transform_for_inference() on each container to prepare kernel state.
  • For non-transformer layers (embeddings, layer norms), replaces forward functions with optimized inference layer equivalents.
  • Triggers garbage collection and cache clearing for ZeRO-3 to free partitioned parameter memory.

The generate() method (L168-272) handles text generation:

  • Records the total batch size for throughput logging.
  • For ZeRO-3 with pinned parameters and TP, gathers parameters in partition groups, fuses LoRA weights, applies tensor parallelism slicing, then calls HuggingFace's model.generate().
  • For ZeRO-3 with pinned parameters without TP, gathers all parameters at once, fuses LoRA, and generates.
  • For non-ZeRO-3 or non-pinned cases, fuses LoRA if applicable and generates directly.
  • Optionally releases inference cache memory after generation.

Code Reference

Property Value
Repository https://github.com/deepspeedai/DeepSpeed
File deepspeed/runtime/hybrid_engine.py
Lines L381-422 (eval), L168-272 (generate)
eval signature def eval(self) -> None
generate signature def generate(self, *inputs, **kwargs) -> torch.Tensor
Import Accessed via engine returned by deepspeed.initialize()

I/O Contract

Inputs (eval)

Name Type Required Description
(none) Switches engine to inference mode; no arguments required

Inputs (generate)

Name Type Required Description
input_ids torch.Tensor Yes Tokenized prompt input IDs
attention_mask torch.Tensor No Attention mask for padded inputs
max_new_tokens int No Maximum number of tokens to generate
temperature float No Sampling temperature
top_p float No Nucleus sampling probability threshold
do_sample bool No Whether to use sampling (True) or greedy decoding (False)

Outputs

Name Type Description
generated_sequences torch.Tensor Generated token sequences including prompt tokens

Usage Example

# Switch to inference mode
engine.eval()

# Generate experience rollouts
prompts = tokenizer(prompt_texts, return_tensors="pt", padding=True)
generated = engine.generate(
    input_ids=prompts.input_ids.to(engine.device),
    attention_mask=prompts.attention_mask.to(engine.device),
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
# Decode generated sequences
responses = tokenizer.batch_decode(generated, skip_special_tokens=True)

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment