Implementation:Intel Ipex llm Pipeline Parallel Serving
| Knowledge Sources | |
|---|---|
| Domains | Serving, Pipeline_Parallelism, Distributed_Inference |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for distributed FastAPI serving using pipeline parallelism with IPEX-LLM's PPModelWorker across multiple Intel XPU devices.
Description
This script sets up a distributed serving application where rank 0 runs a FastAPI HTTP server using FastApp and PPModelWorker, while other ranks process model computation. It uses init_pipeline_parallel for distributed setup and distributes the model across GPUs with pipeline_parallel_stages. The PPModelWorker handles batched request processing with configurable sequence limits.
Usage
Use this for production serving of models that require multi-GPU pipeline parallelism. It provides OpenAI-compatible endpoints with built-in batching and is suitable for models too large for a single GPU.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/GPU/Pipeline-Parallel-Serving/pipeline_serving.py
- Lines: 1-78
Signature
async def main():
"""Async main function setting up distributed FastAPI server."""
# Key API:
init_pipeline_parallel()
worker = PPModelWorker(
model_path, low_bit,
pipeline_parallel_stages=world_size,
max_num_seqs=args.max_num_seqs,
max_prefilled_seqs=args.max_prefilled_seqs,
)
app = FastApp(worker)
Import
from ipex_llm.transformers import init_pipeline_parallel, PPModelWorker
from ipex_llm.serving.fastapi import FastApp
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo-id-or-model-path | str | Yes | HuggingFace model ID or local path |
| low-bit | str | No | Quantization type (default: sym_int4) |
| port | int | No | Server port (default: 8000) |
| max-num-seqs | int | No | Maximum batch sequences (default: 8) |
| max-prefilled-seqs | int | No | Maximum prefilled sequences (default: 0) |
Outputs
| Name | Type | Description |
|---|---|---|
| REST API | HTTP endpoints | OpenAI-compatible generation endpoints on rank 0 |
Usage Examples
Distributed Serving
python -m torch.distributed.run --nproc_per_node 2 \
pipeline_serving.py \
--repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
--low-bit "sym_int4" \
--port 8000 \
--max-num-seqs 8