Implementation:Deepspeedai DeepSpeed AutoModel For Inference
Overview
HuggingFace AutoModel loading for DeepSpeed inference optimization pipeline.
Implementation Type
Wrapper Doc (external HuggingFace API used in DeepSpeed inference context)
Detailed Description
This is a Wrapper Doc for the external transformers.AutoModelForCausalLM.from_pretrained() API used as the first step in DeepSpeed inference workflows. The loaded model is subsequently passed to deepspeed.init_inference() for kernel injection and tensor parallelism.
The AutoModelForCausalLM class (and related AutoModel variants) is part of the HuggingFace Transformers library. It automatically detects the correct model architecture from the model name or path and instantiates the corresponding PyTorch nn.Module with pretrained weights.
DeepSpeed-specific usage guidelines:
- Load model without
device_map— DeepSpeed handles device placement internally duringinit_inference(). - Use
torch_dtype=torch.float16for inference to reduce memory footprint before DeepSpeed optimization. - Avoid quantization flags like
load_in_8bitorload_in_4bit— DeepSpeed provides its own quantization viaQuantizationConfig. - The returned model is a standard
torch.nn.Modulecompatible with DeepSpeed'sInferenceEngine.
Code Reference
- Repository: https://github.com/huggingface/transformers (external)
- External reference: https://huggingface.co/docs/transformers/model_doc/auto
- Signature:
AutoModelForCausalLM.from_pretrained(model_name_or_path: str, **kwargs) -> PreTrainedModel - Import:
from transformers import AutoModelForCausalLM
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | Yes | HuggingFace model ID (e.g., "meta-llama/Llama-2-7b-hf") or local path to model directory |
| torch_dtype | torch.dtype | No | Data type for model weights. Use torch.float16 for DeepSpeed inference.
|
| trust_remote_code | bool | No | Whether to allow custom model code from the Hub. Default False. |
I/O
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | model_name_or_path | str | HuggingFace model identifier or local checkpoint path |
| Output | model | PreTrainedModel (torch.nn.Module) | Loaded pretrained model ready for DeepSpeed inference |
Usage Example
from transformers import AutoModelForCausalLM
import torch
# Load a pretrained model for DeepSpeed inference
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16
)
# Next step: pass to deepspeed.init_inference()
# engine = deepspeed.init_inference(model, ...)
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://huggingface.co/docs/transformers/model_doc/auto
- https://www.deepspeed.ai/tutorials/inference-tutorial/
Relationships
Principle:Deepspeedai_DeepSpeed_Inference_Model_Loading
Metadata
- Workflow: Inference_Engine_Optimization
- Type: Implementation (Wrapper Doc)
- Last Updated: 2026-02-09 00:00 GMT