Implementation:Intel Ipex llm AutoModelForCausalLM From Pretrained PP

Knowledge Sources	IPEX-LLM
Domains	Distributed_Computing, NLP, Model_Loading
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for loading language models with automatic layer distribution across Intel GPUs for pipeline parallel inference.

Description

The AutoModelForCausalLM.from_pretrained with pipeline_parallel_stages parameter loads a model and automatically distributes its layers across the specified number of Intel XPU devices. Combined with load_in_low_bit quantization, this enables inference on models larger than single-GPU memory.

Usage

Use after calling init_pipeline_parallel() to load a model distributed across multiple GPUs.

Code Reference

Source Location

Repository: IPEX-LLM
File: python/llm/example/GPU/Pipeline-Parallel-Inference/generate.py
Lines: 47-53

Signature

model = AutoModelForCausalLM.from_pretrained(
    model_path: str,
    load_in_low_bit: str = "sym_int4",
    optimize_model: bool = True,
    trust_remote_code: bool = True,
    use_cache: bool = True,
    torch_dtype = torch.float16,
    pipeline_parallel_stages: int = N,
) -> PreTrainedModel

Import

from ipex_llm.transformers import AutoModelForCausalLM

I/O Contract

Inputs

Name	Type	Required	Description
model_path	str	Yes	HuggingFace model ID or local path
load_in_low_bit	str	No	Quantization type (default "sym_int4")
pipeline_parallel_stages	int	Yes	Number of GPUs to distribute layers across
optimize_model	bool	No	Enable XPU optimizations (default True)
use_cache	bool	No	Enable KV cache for generation (default True)

Outputs

Name	Type	Description
model	PreTrainedModel	Model with layers distributed across GPUs, ready for generation

Usage Examples

import torch
from ipex_llm.transformers import AutoModelForCausalLM, init_pipeline_parallel
from transformers import AutoTokenizer

init_pipeline_parallel()

# Load model distributed across 2 GPUs
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_low_bit="sym_int4",
    optimize_model=True,
    trust_remote_code=True,
    use_cache=True,
    torch_dtype=torch.float16,
    pipeline_parallel_stages=2,
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf")
local_rank = torch.distributed.get_rank()

# Generate text
input_ids = tokenizer.encode("Once upon a time", return_tensors="pt").to(f'xpu:{local_rank}')
output = model.generate(input_ids, max_new_tokens=32)
torch.xpu.synchronize()

# Only last rank produces output
if local_rank == 1:
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Related Pages

Implements Principle

Principle:Intel_Ipex_llm_Pipeline_Parallel_Model_Loading

Requires Environment

Environment:Intel_Ipex_llm_Pipeline_Parallel_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment