Implementation:Intel Ipex llm Lightweight Serving
| Knowledge Sources | |
|---|---|
| Domains | Serving, FastAPI, REST_API |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for lightweight FastAPI-based LLM serving using IPEX-LLM's built-in FastApp and ModelWorker components.
Description
This script provides a minimal FastAPI serving application using IPEX-LLM's FastApp and ModelWorker classes. It loads a model with low-bit quantization, wraps it in a ModelWorker, and serves it via FastApp with OpenAI-compatible REST endpoints. It also supports audio models (Whisper) with automatic audio processor detection.
Usage
Use this for quick deployment of a single LLM model as a REST API endpoint with minimal configuration. It provides a simpler alternative to the full vLLM or DeepSpeed serving stacks when multi-GPU or advanced batching is not required.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/GPU/Lightweight-Serving/lightweight_serving.py
- Lines: 1-60
Signature
async def main():
"""Async main function setting up FastAPI server."""
# Key API:
worker = ModelWorker(model_path, low_bit, tokenizer=tokenizer)
app = FastApp(worker)
Import
from ipex_llm.serving.fastapi import FastApp, ModelWorker
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo-id-or-model-path | str | Yes | HuggingFace model ID or local path |
| low-bit | str | No | Quantization type (default: sym_int4) |
| port | int | No | Server port (default: 8000) |
Outputs
| Name | Type | Description |
|---|---|---|
| REST API | HTTP endpoints | OpenAI-compatible text generation endpoints |
Usage Examples
Start Lightweight Server
python lightweight_serving.py \
--repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
--low-bit "sym_int4" \
--port 8000