Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Intel Ipex llm DeepSpeed Tensor Parallel Tips

From Leeroopedia




Knowledge Sources
Domains Serving, Tensor_Parallelism, Optimization
Last Updated 2026-02-09 04:00 GMT

Overview

Best practices for DeepSpeed Automatic Tensor Parallelism with IPEX-LLM on Intel XPU.

Description

When using DeepSpeed's Automatic Tensor Parallelism to distribute LLM inference across multiple Intel XPU devices, several practices improve stability and performance. The model should first be loaded on CPU with IPEX-LLM's `optimize_model()` for low-bit quantization before being distributed via `deepspeed.init_inference()`. The `replace_with_kernel_inject=False` flag is required because IPEX-LLM uses custom kernels that are incompatible with DeepSpeed's default kernel injection. Tensor parallel degree should match the number of available XPU devices.

Usage

Use this heuristic when deploying LLMs with DeepSpeed AutoTP on Intel XPU hardware, particularly when using the FastAPI serving pattern or standalone DeepSpeed inference scripts.

The Insight (Rule of Thumb)

  • Action: Load the model on CPU first, apply `optimize_model()`, then distribute with `deepspeed.init_inference()`.
  • Action: Set `replace_with_kernel_inject=False` to avoid conflicts with IPEX-LLM kernels.
  • Action: Set `mp_size` to match the number of XPU devices.
  • Action: Use `dtype=torch.float16` for the distributed model.
  • Trade-off: DeepSpeed AutoTP distributes entire linear layers across devices, which increases inter-device communication compared to pipeline parallelism.

Reasoning

DeepSpeed's kernel injection replaces standard PyTorch modules with optimized CUDA kernels. Since IPEX-LLM already applies its own XPU-specific optimizations via `optimize_model()`, enabling DeepSpeed kernel injection causes conflicts. Loading on CPU first ensures the model can be properly optimized before distribution. The `mp_size` parameter must exactly match the available XPU devices to ensure proper tensor splitting.

Code Evidence

Model loading and optimization from `serving.py`:

model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True)
model = optimize_model(model, low_bit=low_bit)
model = deepspeed.init_inference(model, mp_size=world_size, dtype=torch.float16, replace_with_kernel_inject=False)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment