Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Intel Ipex llm Pipeline Parallel Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Inference
Last Updated 2026-02-09 12:00 GMT

Overview

Multi-GPU Intel XPU environment with PyTorch distributed and IPEX-LLM pipeline parallel API for distributing large model layers across multiple Intel GPUs.

Description

This environment provides an Intel XPU-accelerated context for pipeline parallel inference, where a single large model is split across multiple Intel GPUs by distributing transformer layers. It uses `ipex_llm.transformers.init_pipeline_parallel()` to initialize the distributed runtime and `pipeline_parallel_stages` parameter in `from_pretrained()` to specify the number of GPUs. The environment requires PyTorch distributed with XPU backend and multiple Intel GPUs accessible via `torch.distributed`.

Usage

Use this environment for Pipeline Parallel Inference when a model is too large to fit on a single Intel GPU. It is the mandatory prerequisite for running the IPEX-LLM pipeline parallel initialization, multi-GPU model loading, and distributed generation with rank-based output collection.

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS Intel OneAPI base toolkit required
Hardware Multiple Intel GPUs (2+) Arc/Flex/Max; number specified via `--gpu-num`
GPU Driver Intel GPU drivers Level Zero runtime required for all GPUs
Distributed PyTorch distributed + XPU `torch.distributed` with XPU backend

Dependencies

System Packages

  • Intel OneAPI Base Toolkit
  • `intel-opencl-icd`
  • `intel-level-zero-gpu`
  • `level-zero`

Python Packages

  • `ipex-llm[xpu]` (pre-release)
  • `torch` (XPU variant, with distributed support)
  • `intel_extension_for_pytorch` (XPU variant)
  • `transformers`
  • `oneccl_bind_pt` (for inter-GPU communication)

Credentials

No credentials are required for local pipeline parallel inference. The following runtime configuration is needed:

  • Launch via `torchrun` or equivalent distributed launcher with appropriate `--nproc_per_node` matching `--gpu-num`.

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Install IPEX-LLM with XPU support
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Install distributed communication
pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable

# Launch with 2 GPUs
torchrun --nproc_per_node 2 generate.py --gpu-num 2

Code Evidence

Pipeline parallel initialization from `generate.py:22-25`:

from ipex_llm.transformers import AutoModel, AutoModelForCausalLM, init_pipeline_parallel
from transformers import AutoTokenizer

init_pipeline_parallel()

Multi-GPU model loading from `generate.py:47-53`:

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_low_bit=low_bit,
                                             optimize_model=True,
                                             trust_remote_code=True,
                                             use_cache=True,
                                             torch_dtype=torch.float16,
                                             pipeline_parallel_stages=args.gpu_num)

Rank-based device placement and output from `generate.py:64-83`:

local_rank = torch.distributed.get_rank()
# ...
input_ids = tokenizer.encode(args.prompt, return_tensors="pt").to(f'xpu:{local_rank}')
# ...
output = model.generate(input_ids, max_new_tokens=args.n_predict)
torch.xpu.synchronize()
# ...
if local_rank == args.gpu_num - 1:
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

Warmup requirement from `generate.py:69-71`:

# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
                        max_new_tokens=args.n_predict)

Common Errors

Error Message Cause Solution
`RuntimeError: distributed not initialized` `init_pipeline_parallel()` not called Call `init_pipeline_parallel()` before model loading
`gpu_num mismatch with nproc_per_node` Mismatched GPU count Ensure `--gpu-num` matches `torchrun --nproc_per_node`
`XPU synchronize timeout` Inter-GPU communication stall Check GPU connectivity and OneCCL installation

Compatibility Notes

  • Warmup Required: The first `model.generate()` call is a warmup; timing measurements should use subsequent calls.
  • Rank-Based Output: Only the last rank (`local_rank == gpu_num - 1`) produces the final decoded output.
  • IPEX-LLM Timing Attributes: After generation, `model.first_token_time` and `model.rest_cost_mean` provide profiling data. These are IPEX-LLM extensions not found in standard HuggingFace.
  • Fallback Loading: The code attempts `AutoModelForCausalLM` first, then falls back to `AutoModel` for non-causal architectures.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment