Implementation:Deepspeedai DeepSpeed AutoModel For Inference

Overview

HuggingFace AutoModel loading for DeepSpeed inference optimization pipeline.

Implementation Type

Wrapper Doc (external HuggingFace API used in DeepSpeed inference context)

Detailed Description

This is a Wrapper Doc for the external transformers.AutoModelForCausalLM.from_pretrained() API used as the first step in DeepSpeed inference workflows. The loaded model is subsequently passed to deepspeed.init_inference() for kernel injection and tensor parallelism.

The AutoModelForCausalLM class (and related AutoModel variants) is part of the HuggingFace Transformers library. It automatically detects the correct model architecture from the model name or path and instantiates the corresponding PyTorch nn.Module with pretrained weights.

DeepSpeed-specific usage guidelines:

Load model without device_map — DeepSpeed handles device placement internally during init_inference().
Use torch_dtype=torch.float16 for inference to reduce memory footprint before DeepSpeed optimization.
Avoid quantization flags like load_in_8bit or load_in_4bit — DeepSpeed provides its own quantization via QuantizationConfig.
The returned model is a standard torch.nn.Module compatible with DeepSpeed's InferenceEngine.

Code Reference

Repository: https://github.com/huggingface/transformers (external)
External reference: https://huggingface.co/docs/transformers/model_doc/auto
Signature: AutoModelForCausalLM.from_pretrained(model_name_or_path: str, **kwargs) -> PreTrainedModel
Import: from transformers import AutoModelForCausalLM

Parameters

Parameter	Type	Required	Description
model_name_or_path	str	Yes	HuggingFace model ID (e.g., "meta-llama/Llama-2-7b-hf") or local path to model directory
torch_dtype	torch.dtype	No	Data type for model weights. Use `torch.float16` for DeepSpeed inference.
trust_remote_code	bool	No	Whether to allow custom model code from the Hub. Default False.

I/O

Direction	Name	Type	Description
Input	model_name_or_path	str	HuggingFace model identifier or local checkpoint path
Output	model	PreTrainedModel (torch.nn.Module)	Loaded pretrained model ready for DeepSpeed inference

Usage Example

from transformers import AutoModelForCausalLM
import torch

# Load a pretrained model for DeepSpeed inference
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16
)

# Next step: pass to deepspeed.init_inference()
# engine = deepspeed.init_inference(model, ...)

Knowledge Sources

Relationships

Principle:Deepspeedai_DeepSpeed_Inference_Model_Loading

Metadata

Workflow: Inference_Engine_Optimization
Type: Implementation (Wrapper Doc)
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment