Implementation:Intel Ipex llm Portable Chat

Knowledge Sources	Intel IPEX-LLM
Domains	Inference, Chat_Interface, Streaming
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for interactive multi-model chat with streaming token generation and KV-cache management provided by the IPEX-LLM portable distribution.

Description

This chat interface supports multiple LLM architectures (Llama, Qwen, ChatGLM3, Yi, Baichuan) with model-specific prompt templates and streaming token generation. It uses IPEX-LLM's optimize_model for low-bit quantization and a custom StartRecentKVCache for bounded memory usage during long conversations. The interface provides colored console output and handles model-specific chat formats including system prompts and conversation history.

Usage

Use this as a standalone interactive chat application when running IPEX-LLM optimized models from the portable ZIP distribution. It automatically detects the model architecture and applies the appropriate chat template and streaming strategy.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/portable-zip/chat.py
Lines: 1-360

Signature

def greedy_generate(model, tokenizer, input_ids, past_key_values, max_gen_len, stop_words=[]):
    """Greedy token generation with stop words checking."""

def stream_chat(model, tokenizer, kv_cache, prompt, history):
    """Interactive streaming chat for generic models."""

def chatglm3_stream_chat(model, tokenizer, kv_cache, prompt, history):
    """ChatGLM3-specific streaming chat with stopping criteria."""

def qwen_stream_chat(model, tokenizer, kv_cache, prompt, history):
    """Qwen model streaming chat."""

def llama_stream_chat(model, tokenizer, kv_cache, prompt, history):
    """Llama model streaming chat."""

def auto_select_model(model_path):
    """Load model using AutoModelForCausalLM or AutoModel fallback."""

Import

from ipex_llm import optimize_model
from kv_cache import StartRecentKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

I/O Contract

Inputs

Name	Type	Required	Description
model-path	str	Yes	Path to the local model directory
chat-format	str	No	Chat format: chatglm3, qwen, llama, yi, baichuan, or auto
n-predict	int	No	Max tokens per response (default: 512)
kv-cache-start-size	int	No	KV cache initial tokens (default: 4)
kv-cache-recent-size	int	No	KV cache recent tokens (default: 512)

Outputs

Name	Type	Description
Streaming text	Console	Colored streaming token-by-token output
Conversation history	In-memory	Multi-turn conversation context

Usage Examples

Interactive Chat

python chat.py \
    --model-path "./models/Llama-2-7b-chat" \
    --chat-format llama \
    --n-predict 512

Auto-detect Model Format

python chat.py --model-path "./models/Qwen-7B-Chat"
# Automatically detects Qwen format and applies appropriate template

Related Pages

Environment:Intel_Ipex_llm_Portable_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment