Environment:Intel Ipex llm Pipeline Parallel Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Multi-GPU Intel XPU environment with PyTorch distributed and IPEX-LLM pipeline parallel API for distributing large model layers across multiple Intel GPUs.
Description
This environment provides an Intel XPU-accelerated context for pipeline parallel inference, where a single large model is split across multiple Intel GPUs by distributing transformer layers. It uses `ipex_llm.transformers.init_pipeline_parallel()` to initialize the distributed runtime and `pipeline_parallel_stages` parameter in `from_pretrained()` to specify the number of GPUs. The environment requires PyTorch distributed with XPU backend and multiple Intel GPUs accessible via `torch.distributed`.
Usage
Use this environment for Pipeline Parallel Inference when a model is too large to fit on a single Intel GPU. It is the mandatory prerequisite for running the IPEX-LLM pipeline parallel initialization, multi-GPU model loading, and distributed generation with rank-based output collection.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS | Intel OneAPI base toolkit required |
| Hardware | Multiple Intel GPUs (2+) | Arc/Flex/Max; number specified via `--gpu-num` |
| GPU Driver | Intel GPU drivers | Level Zero runtime required for all GPUs |
| Distributed | PyTorch distributed + XPU | `torch.distributed` with XPU backend |
Dependencies
System Packages
- Intel OneAPI Base Toolkit
- `intel-opencl-icd`
- `intel-level-zero-gpu`
- `level-zero`
Python Packages
- `ipex-llm[xpu]` (pre-release)
- `torch` (XPU variant, with distributed support)
- `intel_extension_for_pytorch` (XPU variant)
- `transformers`
- `oneccl_bind_pt` (for inter-GPU communication)
Credentials
No credentials are required for local pipeline parallel inference. The following runtime configuration is needed:
- Launch via `torchrun` or equivalent distributed launcher with appropriate `--nproc_per_node` matching `--gpu-num`.
Quick Install
# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh
# Install IPEX-LLM with XPU support
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# Install distributed communication
pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable
# Launch with 2 GPUs
torchrun --nproc_per_node 2 generate.py --gpu-num 2
Code Evidence
Pipeline parallel initialization from `generate.py:22-25`:
from ipex_llm.transformers import AutoModel, AutoModelForCausalLM, init_pipeline_parallel
from transformers import AutoTokenizer
init_pipeline_parallel()
Multi-GPU model loading from `generate.py:47-53`:
model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_low_bit=low_bit,
optimize_model=True,
trust_remote_code=True,
use_cache=True,
torch_dtype=torch.float16,
pipeline_parallel_stages=args.gpu_num)
Rank-based device placement and output from `generate.py:64-83`:
local_rank = torch.distributed.get_rank()
# ...
input_ids = tokenizer.encode(args.prompt, return_tensors="pt").to(f'xpu:{local_rank}')
# ...
output = model.generate(input_ids, max_new_tokens=args.n_predict)
torch.xpu.synchronize()
# ...
if local_rank == args.gpu_num - 1:
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
Warmup requirement from `generate.py:69-71`:
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: distributed not initialized` | `init_pipeline_parallel()` not called | Call `init_pipeline_parallel()` before model loading |
| `gpu_num mismatch with nproc_per_node` | Mismatched GPU count | Ensure `--gpu-num` matches `torchrun --nproc_per_node` |
| `XPU synchronize timeout` | Inter-GPU communication stall | Check GPU connectivity and OneCCL installation |
Compatibility Notes
- Warmup Required: The first `model.generate()` call is a warmup; timing measurements should use subsequent calls.
- Rank-Based Output: Only the last rank (`local_rank == gpu_num - 1`) produces the final decoded output.
- IPEX-LLM Timing Attributes: After generation, `model.first_token_time` and `model.rest_cost_mean` provide profiling data. These are IPEX-LLM extensions not found in standard HuggingFace.
- Fallback Loading: The code attempts `AutoModelForCausalLM` first, then falls back to `AutoModel` for non-causal architectures.