Principle:Intel Ipex llm Hybrid CPU GPU Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, Memory_Optimization, Heterogeneous_Computing |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Inference technique that distributes model layers between CPU and GPU to enable running models that exceed single GPU memory capacity.
Description
Hybrid CPU/GPU inference addresses the memory limitation of running large models on hardware with insufficient GPU VRAM. The technique partitions model layers so that some execute on the GPU (for speed) while others execute on the CPU (for memory capacity). IPEX-LLM's convert_model_hybrid API automates this partitioning based on available hardware resources, with data transferred between devices as needed during the forward pass.
Usage
Use this principle when the model size exceeds available GPU memory but full CPU inference is too slow. It provides a practical middle ground for consumer-grade hardware where large models (70B+) cannot fit entirely in GPU VRAM.
Theoretical Basis
Given a model with layers and GPU memory capacity :
Pseudo-code Logic:
# Abstract hybrid partitioning
gpu_layers = select_layers_for_gpu(model, memory_budget=M_gpu)
cpu_layers = remaining_layers(model, gpu_layers)
# During inference:
for layer in model.layers:
if layer in gpu_layers:
x = layer.forward(x.to('xpu'))
else:
x = layer.forward(x.to('cpu'))