Principle:Intel Ipex llm Pipeline Parallel Generation
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, NLP, Inference |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Method for generating text across a pipeline-parallel distributed model with synchronization and output collection on the last GPU rank.
Description
Pipeline Parallel Generation coordinates the autoregressive token generation across multiple GPUs where model layers are distributed. Input tokens are placed on the first GPU's device, activations flow through the pipeline stages, and output is collected only on the last rank (local_rank == gpu_num - 1). XPU synchronization is required to ensure accurate timing. The model exposes timing attributes (first_token_time, rest_cost_mean) for performance analysis.
Usage
Use this after loading a pipeline-parallel model. Always place input_ids on the correct XPU device (xpu:{local_rank}), call torch.xpu.synchronize() before timing, and collect output only on the last rank.
Theoretical Basis
# Abstract pipeline generation (NOT real implementation)
# Autoregressive generation with pipeline parallelism:
for token_idx in range(max_new_tokens):
# Stage 0 (GPU 0): Process input through first layer group
# Stage 1 (GPU 1): Process activations through second layer group
# ...
# Last stage: Produce next-token logits
# All stages synchronize per token