Implementation:Intel Ipex llm Portable Chat
| Knowledge Sources | |
|---|---|
| Domains | Inference, Chat_Interface, Streaming |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for interactive multi-model chat with streaming token generation and KV-cache management provided by the IPEX-LLM portable distribution.
Description
This chat interface supports multiple LLM architectures (Llama, Qwen, ChatGLM3, Yi, Baichuan) with model-specific prompt templates and streaming token generation. It uses IPEX-LLM's optimize_model for low-bit quantization and a custom StartRecentKVCache for bounded memory usage during long conversations. The interface provides colored console output and handles model-specific chat formats including system prompts and conversation history.
Usage
Use this as a standalone interactive chat application when running IPEX-LLM optimized models from the portable ZIP distribution. It automatically detects the model architecture and applies the appropriate chat template and streaming strategy.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/portable-zip/chat.py
- Lines: 1-360
Signature
def greedy_generate(model, tokenizer, input_ids, past_key_values, max_gen_len, stop_words=[]):
"""Greedy token generation with stop words checking."""
def stream_chat(model, tokenizer, kv_cache, prompt, history):
"""Interactive streaming chat for generic models."""
def chatglm3_stream_chat(model, tokenizer, kv_cache, prompt, history):
"""ChatGLM3-specific streaming chat with stopping criteria."""
def qwen_stream_chat(model, tokenizer, kv_cache, prompt, history):
"""Qwen model streaming chat."""
def llama_stream_chat(model, tokenizer, kv_cache, prompt, history):
"""Llama model streaming chat."""
def auto_select_model(model_path):
"""Load model using AutoModelForCausalLM or AutoModel fallback."""
Import
from ipex_llm import optimize_model
from kv_cache import StartRecentKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model-path | str | Yes | Path to the local model directory |
| chat-format | str | No | Chat format: chatglm3, qwen, llama, yi, baichuan, or auto |
| n-predict | int | No | Max tokens per response (default: 512) |
| kv-cache-start-size | int | No | KV cache initial tokens (default: 4) |
| kv-cache-recent-size | int | No | KV cache recent tokens (default: 512) |
Outputs
| Name | Type | Description |
|---|---|---|
| Streaming text | Console | Colored streaming token-by-token output |
| Conversation history | In-memory | Multi-turn conversation context |
Usage Examples
Interactive Chat
python chat.py \
--model-path "./models/Llama-2-7b-chat" \
--chat-format llama \
--n-predict 512
Auto-detect Model Format
python chat.py --model-path "./models/Qwen-7B-Chat"
# Automatically detects Qwen format and applies appropriate template