Implementation:Intel Ipex llm Hybrid Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, Hybrid_Computing, Quantization |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for hybrid CPU/GPU inference using IPEX-LLM's convert_model_hybrid API for large language models.
Description
This script demonstrates the hybrid inference pattern where a model is loaded with IPEX-LLM low-bit quantization and then converted using convert_model_hybrid to distribute computation across CPU and GPU. It supports loading pre-converted low-bit models for faster startup and generates tokens using chain-of-thought formatting with reasoning and answer tags.
Usage
Use this when running models that exceed single GPU memory by leveraging hybrid CPU/GPU computation. Particularly suited for large models like DeepSeek-R1 where some layers can be offloaded to CPU while critical layers run on GPU.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/GPU/DeepSeek-R1/generate_hybrid.py
- Lines: 1-92
Signature
# Script-based execution with argparse
# Key API calls:
model = AutoModelForCausalLM.from_pretrained(
model_path,
load_in_low_bit=args.low_bit,
optimize_model=True,
trust_remote_code=True,
use_cache=True,
torch_dtype=torch.float16,
)
model = convert_model_hybrid(model)
Import
from ipex_llm.transformers import AutoModelForCausalLM, convert_model_hybrid
from transformers import AutoTokenizer, GenerationConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo-id-or-model-path | str | Yes | HuggingFace model ID or local path |
| prompt | str | No | Input prompt for generation |
| n-predict | int | No | Maximum tokens to generate (default: 32) |
| load-path | str | No | Path to pre-converted low-bit model |
Outputs
| Name | Type | Description |
|---|---|---|
| Generated text | Console | Model completion with timing information |
Usage Examples
Hybrid Inference
python generate_hybrid.py \
--repo-id-or-model-path "deepseek-ai/DeepSeek-R1" \
--prompt "What is quantum computing?" \
--n-predict 128