Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Intel Ipex llm Lightweight Serving

From Leeroopedia


Knowledge Sources
Domains Serving, FastAPI, REST_API
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for lightweight FastAPI-based LLM serving using IPEX-LLM's built-in FastApp and ModelWorker components.

Description

This script provides a minimal FastAPI serving application using IPEX-LLM's FastApp and ModelWorker classes. It loads a model with low-bit quantization, wraps it in a ModelWorker, and serves it via FastApp with OpenAI-compatible REST endpoints. It also supports audio models (Whisper) with automatic audio processor detection.

Usage

Use this for quick deployment of a single LLM model as a REST API endpoint with minimal configuration. It provides a simpler alternative to the full vLLM or DeepSpeed serving stacks when multi-GPU or advanced batching is not required.

Code Reference

Source Location

Signature

async def main():
    """Async main function setting up FastAPI server."""

# Key API:
worker = ModelWorker(model_path, low_bit, tokenizer=tokenizer)
app = FastApp(worker)

Import

from ipex_llm.serving.fastapi import FastApp, ModelWorker
from transformers import AutoTokenizer

I/O Contract

Inputs

Name Type Required Description
repo-id-or-model-path str Yes HuggingFace model ID or local path
low-bit str No Quantization type (default: sym_int4)
port int No Server port (default: 8000)

Outputs

Name Type Description
REST API HTTP endpoints OpenAI-compatible text generation endpoints

Usage Examples

Start Lightweight Server

python lightweight_serving.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --low-bit "sym_int4" \
    --port 8000

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment