Principle:Intel Ipex llm Pipeline Parallel Model Loading

Knowledge Sources	IPEX-LLM
Domains	Distributed_Computing, NLP, Model_Loading
Last Updated	2026-02-09 00:00 GMT

Overview

Technique for loading a quantized language model with its layers automatically distributed across multiple Intel GPUs.

Description

Pipeline Parallel Model Loading uses the pipeline_parallel_stages parameter to automatically partition a model's transformer layers across multiple Intel XPU devices. Combined with low-bit quantization (SYM_INT4 by default), this enables inference on models that exceed single-GPU memory. Each GPU holds a contiguous subset of layers and processes activations sequentially.

Usage

Use this after init_pipeline_parallel() when deploying models too large for a single GPU. The gpu_num parameter determines how many pipeline stages (GPUs) the model is split across.

Theoretical Basis

# Abstract layer distribution (NOT real implementation)
# For a model with 40 layers on 2 GPUs:
# GPU 0: embedding + layers 0-19
# GPU 1: layers 20-39 + lm_head
# Forward: GPU0 computes first half, sends activations to GPU1
# GPU1 computes second half and produces logits

Related Pages

Implemented By

Implementation:Intel_Ipex_llm_AutoModelForCausalLM_From_Pretrained_PP

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment