Implementation:Intel Ipex llm Deepspeed AutoTP Inference
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Inference, Tensor_Parallelism, DeepSpeed |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for distributed LLM inference using DeepSpeed Automatic Tensor Parallelism with IPEX-LLM optimization on Intel XPU.
Description
This script implements distributed inference by loading a model on CPU, applying IPEX-LLM low-bit quantization via optimize_model, then distributing across multiple XPU devices using DeepSpeed's deepspeed.init_distributed and deepspeed.init_inference. It configures the Intel XPU accelerator for DeepSpeed and wraps the model with BenchmarkWrapper for performance measurement.
Usage
Use this for inference with models that require multi-GPU tensor parallelism on Intel XPU hardware. Launch via DeepSpeed or mpirun with the appropriate number of GPU processes.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py
- Lines: 1-145
Signature
def get_int_from_env(env_keys, default):
"""Retrieve integer environment variable with fallback."""
# Main flow:
# 1. Set XPU Accelerator
# 2. deepspeed.init_distributed('ccl')
# 3. Load model on CPU
# 4. optimize_model(model, low_bit=args.low_bit)
# 5. deepspeed.init_inference(model, ...)
# 6. model.generate(...)
Import
from ipex_llm import optimize_model
from ipex_llm.utils import BenchmarkWrapper
import deepspeed
from intel_extension_for_deepspeed import XPU_Accelerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo-id-or-model-path | str | Yes | HuggingFace model ID or local path |
| prompt | str | No | Input text for generation |
| n-predict | int | No | Max tokens to generate (default: 32) |
| low-bit | str | No | Quantization type (default: sym_int4) |
Outputs
| Name | Type | Description |
|---|---|---|
| Generated text | Console | Generated completion on rank 0 |
| Benchmark metrics | Console | Token/s throughput via BenchmarkWrapper |
Usage Examples
Multi-GPU Inference
deepspeed --num_gpus 2 deepspeed_autotp.py \
--repo-id-or-model-path "meta-llama/Llama-2-70b-chat-hf" \
--low-bit "sym_int4" \
--prompt "What is deep learning?" \
--n-predict 64