Implementation:Intel Ipex llm Deepspeed AutoTP FastAPI Serving
| Knowledge Sources | |
|---|---|
| Domains | Serving, Tensor_Parallelism, DeepSpeed |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for serving LLMs via FastAPI with DeepSpeed Automatic Tensor Parallelism and IPEX-LLM optimizations on Intel XPU.
Description
This serving application provides REST API endpoints for LLM inference using DeepSpeed's Automatic Tensor Parallelism for distributing model weights across multiple XPU devices. The model is first loaded on CPU, optimized with IPEX-LLM low-bit quantization, then distributed via DeepSpeed Inference. It supports both streaming and non-streaming generation with request batching through an async queue processor.
Usage
Use this when deploying large language models that require multi-GPU tensor parallelism on Intel XPU hardware. It provides OpenAI-compatible streaming and non-streaming endpoints suitable for production serving with DeepSpeed-managed distributed inference.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/GPU/Deepspeed-AutoTP-FastAPI/serving.py
- Lines: 1-419
Signature
def load_model(model_path, low_bit):
"""Load model with IPEX-LLM optimization and DeepSpeed inference."""
async def generate_stream_gate(
prompt: List[str],
n_predict: int = 32,
request_ids: list = [],
):
"""Async batched generation with streaming support."""
@app.post("/generate/")
async def generate(prompt_request: PromptRequest):
"""Non-streaming generation endpoint."""
@app.post("/generate_stream/")
async def generate_stream(prompt_request: PromptRequest):
"""Streaming generation endpoint."""
class PromptRequest(BaseModel):
prompt: str
n_predict: int = 32
Import
from ipex_llm import optimize_model
import deepspeed
from fastapi import FastAPI
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str | Yes | Input text for generation |
| n_predict | int | No | Maximum tokens to generate (default: 32) |
| model_path | str (CLI) | Yes | HuggingFace model ID or local path |
| low_bit | str (CLI) | No | Quantization type (default: sym_int4) |
Outputs
| Name | Type | Description |
|---|---|---|
| /generate/ response | JSON | Complete generated text |
| /generate_stream/ response | SSE stream | Token-by-token streaming response |
Usage Examples
Starting the Server
# Launch with DeepSpeed on 2 XPU devices:
deepspeed --num_gpus 2 serving.py \
--repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
--low-bit "sym_int4" \
--port 8000
Sending Requests
import requests
# Non-streaming
response = requests.post(
"http://localhost:8000/generate/",
json={"prompt": "What is AI?", "n_predict": 64}
)
print(response.json())
# Streaming
response = requests.post(
"http://localhost:8000/generate_stream/",
json={"prompt": "What is AI?", "n_predict": 64},
stream=True,
)
for chunk in response.iter_lines():
print(chunk.decode())