Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Intel Ipex llm Portable Chat

From Leeroopedia


Knowledge Sources
Domains Inference, Chat_Interface, Streaming
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for interactive multi-model chat with streaming token generation and KV-cache management provided by the IPEX-LLM portable distribution.

Description

This chat interface supports multiple LLM architectures (Llama, Qwen, ChatGLM3, Yi, Baichuan) with model-specific prompt templates and streaming token generation. It uses IPEX-LLM's optimize_model for low-bit quantization and a custom StartRecentKVCache for bounded memory usage during long conversations. The interface provides colored console output and handles model-specific chat formats including system prompts and conversation history.

Usage

Use this as a standalone interactive chat application when running IPEX-LLM optimized models from the portable ZIP distribution. It automatically detects the model architecture and applies the appropriate chat template and streaming strategy.

Code Reference

Source Location

Signature

def greedy_generate(model, tokenizer, input_ids, past_key_values, max_gen_len, stop_words=[]):
    """Greedy token generation with stop words checking."""

def stream_chat(model, tokenizer, kv_cache, prompt, history):
    """Interactive streaming chat for generic models."""

def chatglm3_stream_chat(model, tokenizer, kv_cache, prompt, history):
    """ChatGLM3-specific streaming chat with stopping criteria."""

def qwen_stream_chat(model, tokenizer, kv_cache, prompt, history):
    """Qwen model streaming chat."""

def llama_stream_chat(model, tokenizer, kv_cache, prompt, history):
    """Llama model streaming chat."""

def auto_select_model(model_path):
    """Load model using AutoModelForCausalLM or AutoModel fallback."""

Import

from ipex_llm import optimize_model
from kv_cache import StartRecentKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

I/O Contract

Inputs

Name Type Required Description
model-path str Yes Path to the local model directory
chat-format str No Chat format: chatglm3, qwen, llama, yi, baichuan, or auto
n-predict int No Max tokens per response (default: 512)
kv-cache-start-size int No KV cache initial tokens (default: 4)
kv-cache-recent-size int No KV cache recent tokens (default: 512)

Outputs

Name Type Description
Streaming text Console Colored streaming token-by-token output
Conversation history In-memory Multi-turn conversation context

Usage Examples

Interactive Chat

python chat.py \
    --model-path "./models/Llama-2-7b-chat" \
    --chat-format llama \
    --n-predict 512

Auto-detect Model Format

python chat.py --model-path "./models/Qwen-7B-Chat"
# Automatically detects Qwen format and applies appropriate template

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment