Implementation:Intel Ipex llm Hybrid Inference

Knowledge Sources	Intel IPEX-LLM
Domains	Inference, Hybrid_Computing, Quantization
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for hybrid CPU/GPU inference using IPEX-LLM's convert_model_hybrid API for large language models.

Description

This script demonstrates the hybrid inference pattern where a model is loaded with IPEX-LLM low-bit quantization and then converted using convert_model_hybrid to distribute computation across CPU and GPU. It supports loading pre-converted low-bit models for faster startup and generates tokens using chain-of-thought formatting with reasoning and answer tags.

Usage

Use this when running models that exceed single GPU memory by leveraging hybrid CPU/GPU computation. Particularly suited for large models like DeepSeek-R1 where some layers can be offloaded to CPU while critical layers run on GPU.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/GPU/DeepSeek-R1/generate_hybrid.py
Lines: 1-92

Signature

# Script-based execution with argparse
# Key API calls:
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_low_bit=args.low_bit,
    optimize_model=True,
    trust_remote_code=True,
    use_cache=True,
    torch_dtype=torch.float16,
)
model = convert_model_hybrid(model)

Import

from ipex_llm.transformers import AutoModelForCausalLM, convert_model_hybrid
from transformers import AutoTokenizer, GenerationConfig

I/O Contract

Inputs

Name	Type	Required	Description
repo-id-or-model-path	str	Yes	HuggingFace model ID or local path
prompt	str	No	Input prompt for generation
n-predict	int	No	Maximum tokens to generate (default: 32)
load-path	str	No	Path to pre-converted low-bit model

Outputs

Name	Type	Description
Generated text	Console	Model completion with timing information

Usage Examples

Hybrid Inference

python generate_hybrid.py \
    --repo-id-or-model-path "deepseek-ai/DeepSeek-R1" \
    --prompt "What is quantum computing?" \
    --n-predict 128

Related Pages

Environment:Intel_Ipex_llm_XPU_Inference_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment