Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm Deepspeed AutoTP Inference

From Leeroopedia


Knowledge Sources
Domains Distributed_Inference, Tensor_Parallelism, DeepSpeed
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for distributed LLM inference using DeepSpeed Automatic Tensor Parallelism with IPEX-LLM optimization on Intel XPU.

Description

This script implements distributed inference by loading a model on CPU, applying IPEX-LLM low-bit quantization via optimize_model, then distributing across multiple XPU devices using DeepSpeed's deepspeed.init_distributed and deepspeed.init_inference. It configures the Intel XPU accelerator for DeepSpeed and wraps the model with BenchmarkWrapper for performance measurement.

Usage

Use this for inference with models that require multi-GPU tensor parallelism on Intel XPU hardware. Launch via DeepSpeed or mpirun with the appropriate number of GPU processes.

Code Reference

Source Location

Signature

def get_int_from_env(env_keys, default):
    """Retrieve integer environment variable with fallback."""

# Main flow:
# 1. Set XPU Accelerator
# 2. deepspeed.init_distributed('ccl')
# 3. Load model on CPU
# 4. optimize_model(model, low_bit=args.low_bit)
# 5. deepspeed.init_inference(model, ...)
# 6. model.generate(...)

Import

from ipex_llm import optimize_model
from ipex_llm.utils import BenchmarkWrapper
import deepspeed
from intel_extension_for_deepspeed import XPU_Accelerator

I/O Contract

Inputs

Name Type Required Description
repo-id-or-model-path str Yes HuggingFace model ID or local path
prompt str No Input text for generation
n-predict int No Max tokens to generate (default: 32)
low-bit str No Quantization type (default: sym_int4)

Outputs

Name Type Description
Generated text Console Generated completion on rank 0
Benchmark metrics Console Token/s throughput via BenchmarkWrapper

Usage Examples

Multi-GPU Inference

deepspeed --num_gpus 2 deepspeed_autotp.py \
    --repo-id-or-model-path "meta-llama/Llama-2-70b-chat-hf" \
    --low-bit "sym_int4" \
    --prompt "What is deep learning?" \
    --n-predict 64

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment