Implementation:Intel Ipex llm Deepspeed AutoTP FastAPI Serving

Knowledge Sources	Intel IPEX-LLM DeepSpeed Inference
Domains	Serving, Tensor_Parallelism, DeepSpeed
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for serving LLMs via FastAPI with DeepSpeed Automatic Tensor Parallelism and IPEX-LLM optimizations on Intel XPU.

Description

This serving application provides REST API endpoints for LLM inference using DeepSpeed's Automatic Tensor Parallelism for distributing model weights across multiple XPU devices. The model is first loaded on CPU, optimized with IPEX-LLM low-bit quantization, then distributed via DeepSpeed Inference. It supports both streaming and non-streaming generation with request batching through an async queue processor.

Usage

Use this when deploying large language models that require multi-GPU tensor parallelism on Intel XPU hardware. It provides OpenAI-compatible streaming and non-streaming endpoints suitable for production serving with DeepSpeed-managed distributed inference.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/GPU/Deepspeed-AutoTP-FastAPI/serving.py
Lines: 1-419

Signature

def load_model(model_path, low_bit):
    """Load model with IPEX-LLM optimization and DeepSpeed inference."""

async def generate_stream_gate(
    prompt: List[str],
    n_predict: int = 32,
    request_ids: list = [],
):
    """Async batched generation with streaming support."""

@app.post("/generate/")
async def generate(prompt_request: PromptRequest):
    """Non-streaming generation endpoint."""

@app.post("/generate_stream/")
async def generate_stream(prompt_request: PromptRequest):
    """Streaming generation endpoint."""

class PromptRequest(BaseModel):
    prompt: str
    n_predict: int = 32

Import

from ipex_llm import optimize_model
import deepspeed
from fastapi import FastAPI

I/O Contract

Inputs

Name	Type	Required	Description
prompt	str	Yes	Input text for generation
n_predict	int	No	Maximum tokens to generate (default: 32)
model_path	str (CLI)	Yes	HuggingFace model ID or local path
low_bit	str (CLI)	No	Quantization type (default: sym_int4)

Outputs

Name	Type	Description
/generate/ response	JSON	Complete generated text
/generate_stream/ response	SSE stream	Token-by-token streaming response

Usage Examples

Starting the Server

# Launch with DeepSpeed on 2 XPU devices:
deepspeed --num_gpus 2 serving.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --low-bit "sym_int4" \
    --port 8000

Sending Requests

import requests

# Non-streaming
response = requests.post(
    "http://localhost:8000/generate/",
    json={"prompt": "What is AI?", "n_predict": 64}
)
print(response.json())

# Streaming
response = requests.post(
    "http://localhost:8000/generate_stream/",
    json={"prompt": "What is AI?", "n_predict": 64},
    stream=True,
)
for chunk in response.iter_lines():
    print(chunk.decode())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment