Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm Pipeline Parallel Serving

From Leeroopedia


Knowledge Sources
Domains Serving, Pipeline_Parallelism, Distributed_Inference
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for distributed FastAPI serving using pipeline parallelism with IPEX-LLM's PPModelWorker across multiple Intel XPU devices.

Description

This script sets up a distributed serving application where rank 0 runs a FastAPI HTTP server using FastApp and PPModelWorker, while other ranks process model computation. It uses init_pipeline_parallel for distributed setup and distributes the model across GPUs with pipeline_parallel_stages. The PPModelWorker handles batched request processing with configurable sequence limits.

Usage

Use this for production serving of models that require multi-GPU pipeline parallelism. It provides OpenAI-compatible endpoints with built-in batching and is suitable for models too large for a single GPU.

Code Reference

Source Location

Signature

async def main():
    """Async main function setting up distributed FastAPI server."""

# Key API:
init_pipeline_parallel()
worker = PPModelWorker(
    model_path, low_bit,
    pipeline_parallel_stages=world_size,
    max_num_seqs=args.max_num_seqs,
    max_prefilled_seqs=args.max_prefilled_seqs,
)
app = FastApp(worker)

Import

from ipex_llm.transformers import init_pipeline_parallel, PPModelWorker
from ipex_llm.serving.fastapi import FastApp

I/O Contract

Inputs

Name Type Required Description
repo-id-or-model-path str Yes HuggingFace model ID or local path
low-bit str No Quantization type (default: sym_int4)
port int No Server port (default: 8000)
max-num-seqs int No Maximum batch sequences (default: 8)
max-prefilled-seqs int No Maximum prefilled sequences (default: 0)

Outputs

Name Type Description
REST API HTTP endpoints OpenAI-compatible generation endpoints on rank 0

Usage Examples

Distributed Serving

python -m torch.distributed.run --nproc_per_node 2 \
    pipeline_serving.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --low-bit "sym_int4" \
    --port 8000 \
    --max-num-seqs 8

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment