Implementation:Intel Ipex llm AutoModelForCausalLM From Pretrained PP
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, NLP, Model_Loading |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for loading language models with automatic layer distribution across Intel GPUs for pipeline parallel inference.
Description
The AutoModelForCausalLM.from_pretrained with pipeline_parallel_stages parameter loads a model and automatically distributes its layers across the specified number of Intel XPU devices. Combined with load_in_low_bit quantization, this enables inference on models larger than single-GPU memory.
Usage
Use after calling init_pipeline_parallel() to load a model distributed across multiple GPUs.
Code Reference
Source Location
- Repository: IPEX-LLM
- File: python/llm/example/GPU/Pipeline-Parallel-Inference/generate.py
- Lines: 47-53
Signature
model = AutoModelForCausalLM.from_pretrained(
model_path: str,
load_in_low_bit: str = "sym_int4",
optimize_model: bool = True,
trust_remote_code: bool = True,
use_cache: bool = True,
torch_dtype = torch.float16,
pipeline_parallel_stages: int = N,
) -> PreTrainedModel
Import
from ipex_llm.transformers import AutoModelForCausalLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes | HuggingFace model ID or local path |
| load_in_low_bit | str | No | Quantization type (default "sym_int4") |
| pipeline_parallel_stages | int | Yes | Number of GPUs to distribute layers across |
| optimize_model | bool | No | Enable XPU optimizations (default True) |
| use_cache | bool | No | Enable KV cache for generation (default True) |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | Model with layers distributed across GPUs, ready for generation |
Usage Examples
import torch
from ipex_llm.transformers import AutoModelForCausalLM, init_pipeline_parallel
from transformers import AutoTokenizer
init_pipeline_parallel()
# Load model distributed across 2 GPUs
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-chat-hf",
load_in_low_bit="sym_int4",
optimize_model=True,
trust_remote_code=True,
use_cache=True,
torch_dtype=torch.float16,
pipeline_parallel_stages=2,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf")
local_rank = torch.distributed.get_rank()
# Generate text
input_ids = tokenizer.encode("Once upon a time", return_tensors="pt").to(f'xpu:{local_rank}')
output = model.generate(input_ids, max_new_tokens=32)
torch.xpu.synchronize()
# Only last rank produces output
if local_rank == 1:
print(tokenizer.decode(output[0], skip_special_tokens=True))
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment