Principle:Intel Ipex llm Pipeline Parallel Generation

Knowledge Sources	IPEX-LLM
Domains	Distributed_Computing, NLP, Inference
Last Updated	2026-02-09 00:00 GMT

Overview

Method for generating text across a pipeline-parallel distributed model with synchronization and output collection on the last GPU rank.

Description

Pipeline Parallel Generation coordinates the autoregressive token generation across multiple GPUs where model layers are distributed. Input tokens are placed on the first GPU's device, activations flow through the pipeline stages, and output is collected only on the last rank (local_rank == gpu_num - 1). XPU synchronization is required to ensure accurate timing. The model exposes timing attributes (first_token_time, rest_cost_mean) for performance analysis.

Usage

Use this after loading a pipeline-parallel model. Always place input_ids on the correct XPU device (xpu:{local_rank}), call torch.xpu.synchronize() before timing, and collect output only on the last rank.

Theoretical Basis

# Abstract pipeline generation (NOT real implementation)
# Autoregressive generation with pipeline parallelism:
for token_idx in range(max_new_tokens):
    # Stage 0 (GPU 0): Process input through first layer group
    # Stage 1 (GPU 1): Process activations through second layer group
    # ...
    # Last stage: Produce next-token logits
    # All stages synchronize per token

Related Pages

Implemented By

Implementation:Intel_Ipex_llm_Model_Generate_PP

Uses Heuristic

Heuristic:Intel_Ipex_llm_Use_Cache_Training_Vs_Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment