Implementation:Intel Ipex llm Lightweight Serving

Knowledge Sources	Intel IPEX-LLM
Domains	Serving, FastAPI, REST_API
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for lightweight FastAPI-based LLM serving using IPEX-LLM's built-in FastApp and ModelWorker components.

Description

This script provides a minimal FastAPI serving application using IPEX-LLM's FastApp and ModelWorker classes. It loads a model with low-bit quantization, wraps it in a ModelWorker, and serves it via FastApp with OpenAI-compatible REST endpoints. It also supports audio models (Whisper) with automatic audio processor detection.

Usage

Use this for quick deployment of a single LLM model as a REST API endpoint with minimal configuration. It provides a simpler alternative to the full vLLM or DeepSpeed serving stacks when multi-GPU or advanced batching is not required.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/GPU/Lightweight-Serving/lightweight_serving.py
Lines: 1-60

Signature

async def main():
    """Async main function setting up FastAPI server."""

# Key API:
worker = ModelWorker(model_path, low_bit, tokenizer=tokenizer)
app = FastApp(worker)

Import

from ipex_llm.serving.fastapi import FastApp, ModelWorker
from transformers import AutoTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
repo-id-or-model-path	str	Yes	HuggingFace model ID or local path
low-bit	str	No	Quantization type (default: sym_int4)
port	int	No	Server port (default: 8000)

Outputs

Name	Type	Description
REST API	HTTP endpoints	OpenAI-compatible text generation endpoints

Usage Examples

Start Lightweight Server

python lightweight_serving.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --low-bit "sym_int4" \
    --port 8000

Related Pages

Environment:Intel_Ipex_llm_XPU_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment