Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm AutoModelForCausalLM From Pretrained PP

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, NLP, Model_Loading
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for loading language models with automatic layer distribution across Intel GPUs for pipeline parallel inference.

Description

The AutoModelForCausalLM.from_pretrained with pipeline_parallel_stages parameter loads a model and automatically distributes its layers across the specified number of Intel XPU devices. Combined with load_in_low_bit quantization, this enables inference on models larger than single-GPU memory.

Usage

Use after calling init_pipeline_parallel() to load a model distributed across multiple GPUs.

Code Reference

Source Location

  • Repository: IPEX-LLM
  • File: python/llm/example/GPU/Pipeline-Parallel-Inference/generate.py
  • Lines: 47-53

Signature

model = AutoModelForCausalLM.from_pretrained(
    model_path: str,
    load_in_low_bit: str = "sym_int4",
    optimize_model: bool = True,
    trust_remote_code: bool = True,
    use_cache: bool = True,
    torch_dtype = torch.float16,
    pipeline_parallel_stages: int = N,
) -> PreTrainedModel

Import

from ipex_llm.transformers import AutoModelForCausalLM

I/O Contract

Inputs

Name Type Required Description
model_path str Yes HuggingFace model ID or local path
load_in_low_bit str No Quantization type (default "sym_int4")
pipeline_parallel_stages int Yes Number of GPUs to distribute layers across
optimize_model bool No Enable XPU optimizations (default True)
use_cache bool No Enable KV cache for generation (default True)

Outputs

Name Type Description
model PreTrainedModel Model with layers distributed across GPUs, ready for generation

Usage Examples

import torch
from ipex_llm.transformers import AutoModelForCausalLM, init_pipeline_parallel
from transformers import AutoTokenizer

init_pipeline_parallel()

# Load model distributed across 2 GPUs
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_low_bit="sym_int4",
    optimize_model=True,
    trust_remote_code=True,
    use_cache=True,
    torch_dtype=torch.float16,
    pipeline_parallel_stages=2,
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf")
local_rank = torch.distributed.get_rank()

# Generate text
input_ids = tokenizer.encode("Once upon a time", return_tensors="pt").to(f'xpu:{local_rank}')
output = model.generate(input_ids, max_new_tokens=32)
torch.xpu.synchronize()

# Only last rank produces output
if local_rank == 1:
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment