Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Intel Ipex llm Deepspeed AutoTP FastAPI Serving

From Leeroopedia


Knowledge Sources
Domains Serving, Tensor_Parallelism, DeepSpeed
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for serving LLMs via FastAPI with DeepSpeed Automatic Tensor Parallelism and IPEX-LLM optimizations on Intel XPU.

Description

This serving application provides REST API endpoints for LLM inference using DeepSpeed's Automatic Tensor Parallelism for distributing model weights across multiple XPU devices. The model is first loaded on CPU, optimized with IPEX-LLM low-bit quantization, then distributed via DeepSpeed Inference. It supports both streaming and non-streaming generation with request batching through an async queue processor.

Usage

Use this when deploying large language models that require multi-GPU tensor parallelism on Intel XPU hardware. It provides OpenAI-compatible streaming and non-streaming endpoints suitable for production serving with DeepSpeed-managed distributed inference.

Code Reference

Source Location

Signature

def load_model(model_path, low_bit):
    """Load model with IPEX-LLM optimization and DeepSpeed inference."""

async def generate_stream_gate(
    prompt: List[str],
    n_predict: int = 32,
    request_ids: list = [],
):
    """Async batched generation with streaming support."""

@app.post("/generate/")
async def generate(prompt_request: PromptRequest):
    """Non-streaming generation endpoint."""

@app.post("/generate_stream/")
async def generate_stream(prompt_request: PromptRequest):
    """Streaming generation endpoint."""

class PromptRequest(BaseModel):
    prompt: str
    n_predict: int = 32

Import

from ipex_llm import optimize_model
import deepspeed
from fastapi import FastAPI

I/O Contract

Inputs

Name Type Required Description
prompt str Yes Input text for generation
n_predict int No Maximum tokens to generate (default: 32)
model_path str (CLI) Yes HuggingFace model ID or local path
low_bit str (CLI) No Quantization type (default: sym_int4)

Outputs

Name Type Description
/generate/ response JSON Complete generated text
/generate_stream/ response SSE stream Token-by-token streaming response

Usage Examples

Starting the Server

# Launch with DeepSpeed on 2 XPU devices:
deepspeed --num_gpus 2 serving.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --low-bit "sym_int4" \
    --port 8000

Sending Requests

import requests

# Non-streaming
response = requests.post(
    "http://localhost:8000/generate/",
    json={"prompt": "What is AI?", "n_predict": 64}
)
print(response.json())

# Streaming
response = requests.post(
    "http://localhost:8000/generate_stream/",
    json={"prompt": "What is AI?", "n_predict": 64},
    stream=True,
)
for chunk in response.iter_lines():
    print(chunk.decode())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment