Heuristic:Intel Ipex llm DeepSpeed Tensor Parallel Tips
| Knowledge Sources | |
|---|---|
| Domains | Serving, Tensor_Parallelism, Optimization |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Best practices for DeepSpeed Automatic Tensor Parallelism with IPEX-LLM on Intel XPU.
Description
When using DeepSpeed's Automatic Tensor Parallelism to distribute LLM inference across multiple Intel XPU devices, several practices improve stability and performance. The model should first be loaded on CPU with IPEX-LLM's `optimize_model()` for low-bit quantization before being distributed via `deepspeed.init_inference()`. The `replace_with_kernel_inject=False` flag is required because IPEX-LLM uses custom kernels that are incompatible with DeepSpeed's default kernel injection. Tensor parallel degree should match the number of available XPU devices.
Usage
Use this heuristic when deploying LLMs with DeepSpeed AutoTP on Intel XPU hardware, particularly when using the FastAPI serving pattern or standalone DeepSpeed inference scripts.
The Insight (Rule of Thumb)
- Action: Load the model on CPU first, apply `optimize_model()`, then distribute with `deepspeed.init_inference()`.
- Action: Set `replace_with_kernel_inject=False` to avoid conflicts with IPEX-LLM kernels.
- Action: Set `mp_size` to match the number of XPU devices.
- Action: Use `dtype=torch.float16` for the distributed model.
- Trade-off: DeepSpeed AutoTP distributes entire linear layers across devices, which increases inter-device communication compared to pipeline parallelism.
Reasoning
DeepSpeed's kernel injection replaces standard PyTorch modules with optimized CUDA kernels. Since IPEX-LLM already applies its own XPU-specific optimizations via `optimize_model()`, enabling DeepSpeed kernel injection causes conflicts. Loading on CPU first ensures the model can be properly optimized before distribution. The `mp_size` parameter must exactly match the available XPU devices to ensure proper tensor splitting.
Code Evidence
Model loading and optimization from `serving.py`:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True)
model = optimize_model(model, low_bit=low_bit)
model = deepspeed.init_inference(model, mp_size=world_size, dtype=torch.float16, replace_with_kernel_inject=False)