Implementation:Intel Ipex llm Pipeline Parallel Serving

Knowledge Sources	Intel IPEX-LLM
Domains	Serving, Pipeline_Parallelism, Distributed_Inference
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for distributed FastAPI serving using pipeline parallelism with IPEX-LLM's PPModelWorker across multiple Intel XPU devices.

Description

This script sets up a distributed serving application where rank 0 runs a FastAPI HTTP server using FastApp and PPModelWorker, while other ranks process model computation. It uses init_pipeline_parallel for distributed setup and distributes the model across GPUs with pipeline_parallel_stages. The PPModelWorker handles batched request processing with configurable sequence limits.

Usage

Use this for production serving of models that require multi-GPU pipeline parallelism. It provides OpenAI-compatible endpoints with built-in batching and is suitable for models too large for a single GPU.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/GPU/Pipeline-Parallel-Serving/pipeline_serving.py
Lines: 1-78

Signature

async def main():
    """Async main function setting up distributed FastAPI server."""

# Key API:
init_pipeline_parallel()
worker = PPModelWorker(
    model_path, low_bit,
    pipeline_parallel_stages=world_size,
    max_num_seqs=args.max_num_seqs,
    max_prefilled_seqs=args.max_prefilled_seqs,
)
app = FastApp(worker)

Import

from ipex_llm.transformers import init_pipeline_parallel, PPModelWorker
from ipex_llm.serving.fastapi import FastApp

I/O Contract

Inputs

Name	Type	Required	Description
repo-id-or-model-path	str	Yes	HuggingFace model ID or local path
low-bit	str	No	Quantization type (default: sym_int4)
port	int	No	Server port (default: 8000)
max-num-seqs	int	No	Maximum batch sequences (default: 8)
max-prefilled-seqs	int	No	Maximum prefilled sequences (default: 0)

Outputs

Name	Type	Description
REST API	HTTP endpoints	OpenAI-compatible generation endpoints on rank 0

Usage Examples

Distributed Serving

python -m torch.distributed.run --nproc_per_node 2 \
    pipeline_serving.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --low-bit "sym_int4" \
    --port 8000 \
    --max-num-seqs 8

Related Pages

Environment:Intel_Ipex_llm_Pipeline_Parallel_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment