Implementation:Intel Ipex llm Deepspeed AutoTP Inference

Knowledge Sources	Intel IPEX-LLM DeepSpeed Inference
Domains	Distributed_Inference, Tensor_Parallelism, DeepSpeed
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for distributed LLM inference using DeepSpeed Automatic Tensor Parallelism with IPEX-LLM optimization on Intel XPU.

Description

This script implements distributed inference by loading a model on CPU, applying IPEX-LLM low-bit quantization via optimize_model, then distributing across multiple XPU devices using DeepSpeed's deepspeed.init_distributed and deepspeed.init_inference. It configures the Intel XPU accelerator for DeepSpeed and wraps the model with BenchmarkWrapper for performance measurement.

Usage

Use this for inference with models that require multi-GPU tensor parallelism on Intel XPU hardware. Launch via DeepSpeed or mpirun with the appropriate number of GPU processes.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py
Lines: 1-145

Signature

def get_int_from_env(env_keys, default):
    """Retrieve integer environment variable with fallback."""

# Main flow:
# 1. Set XPU Accelerator
# 2. deepspeed.init_distributed('ccl')
# 3. Load model on CPU
# 4. optimize_model(model, low_bit=args.low_bit)
# 5. deepspeed.init_inference(model, ...)
# 6. model.generate(...)

Import

from ipex_llm import optimize_model
from ipex_llm.utils import BenchmarkWrapper
import deepspeed
from intel_extension_for_deepspeed import XPU_Accelerator

I/O Contract

Inputs

Name	Type	Required	Description
repo-id-or-model-path	str	Yes	HuggingFace model ID or local path
prompt	str	No	Input text for generation
n-predict	int	No	Max tokens to generate (default: 32)
low-bit	str	No	Quantization type (default: sym_int4)

Outputs

Name	Type	Description
Generated text	Console	Generated completion on rank 0
Benchmark metrics	Console	Token/s throughput via BenchmarkWrapper

Usage Examples

Multi-GPU Inference

deepspeed --num_gpus 2 deepspeed_autotp.py \
    --repo-id-or-model-path "meta-llama/Llama-2-70b-chat-hf" \
    --low-bit "sym_int4" \
    --prompt "What is deep learning?" \
    --n-predict 64

Related Pages

Environment:Intel_Ipex_llm_XPU_Inference_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment