Implementation:Microsoft DeepSpeedExamples Create HF Model
Overview
Concrete tool for initializing HuggingFace causal language models for SFT training provided by the DeepSpeed-Chat library.
Description
The create_hf_model function is a model initialization utility in the DeepSpeed-Chat pipeline. It handles the full lifecycle of preparing a HuggingFace pre-trained model for distributed fine-tuning:
- Configuration loading: Retrieves the model's
AutoConfigfrom the specified model name or local path. - Dropout override: Optionally reconfigures dropout rates (
dropout,attention_dropout,hidden_dropout,activation_dropout) across the model configuration via theconfigure_dropouthelper. - DeepSpeed ZeRO-3 integration: When a DeepSpeed config specifying ZeRO Stage 3 is provided, instantiates an
HfDeepSpeedConfigobject. This ensures that model weights are partitioned across GPUs during loading rather than being fully materialized on each device. - Model loading: Loads pre-trained weights via
from_pretrainedfor standard training, or usesfrom_configwithno_init_weightswhenrlhf_training=True(weight loading is deferred to a separate checkpoint-loading step in the RLHF pipeline). - Tokenizer alignment: Sets the model's
end_token_idandpad_token_idto the tokenizer's EOS token, ensuring consistent special token handling. - Embedding resizing: Resizes the token embedding matrix to the nearest multiple of 8 that accommodates the tokenizer vocabulary. This alignment to multiples of 8 optimizes GPU tensor core utilization.
Usage
Import create_hf_model when initializing any HuggingFace CausalLM model for RLHF training phases. It is used in:
- SFT (Step 1): Loads the pre-trained base model for supervised fine-tuning on instruction-following data.
- Reward Model creation: Called internally by
create_critic_modelto initialize the base transformer before wrapping it in aRewardModelhead. - RLHF (Step 3): Initializes actor and reference models for PPO training.
Code Reference
Source
| Repository | File |
|---|---|
| DeepSpeedExamples | applications/DeepSpeed-Chat/dschat/utils/model/model_utils.py
|
Signature
def create_hf_model(
model_class,
model_name_or_path,
tokenizer,
ds_config=None,
rlhf_training=False,
dropout=None
) -> nn.Module:
Import
from dschat.utils.model.model_utils import create_hf_model
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
model_class |
type |
Yes | HuggingFace model class, e.g. AutoModelForCausalLM
|
model_name_or_path |
str |
Yes | HuggingFace model identifier or local filesystem path to pre-trained weights |
tokenizer |
AutoTokenizer |
Yes | HuggingFace tokenizer instance (used for EOS/pad token IDs and vocabulary size) |
ds_config |
dict |
No | DeepSpeed configuration dictionary; triggers ZeRO-3 integration when zero_optimization.stage == 3
|
rlhf_training |
bool |
No | When True, skips weight loading (uses from_config with no_init_weights); enables deferred checkpoint loading for RLHF
|
dropout |
float |
No | Override value for all dropout rates in the model configuration |
Outputs
| Name | Type | Description |
|---|---|---|
model |
nn.Module |
Initialized HuggingFace model with resized token embeddings, aligned pad/EOS tokens, and optional ZeRO-3 partitioning |
Usage Examples
The following example demonstrates loading OPT-1.3B for SFT training with DeepSpeed ZeRO-2, matching the configuration used in the Step 1 training scripts:
from transformers import AutoModelForCausalLM
from dschat.utils.model.model_utils import create_hf_model
from dschat.utils.utils import load_hf_tokenizer
from dschat.utils.ds_utils import get_train_ds_config
# Load tokenizer
tokenizer = load_hf_tokenizer("facebook/opt-1.3b", fast_tokenizer=True)
# Configure DeepSpeed with ZeRO Stage 2
ds_config = get_train_ds_config(offload=False, dtype="fp16", stage=2)
ds_config["train_micro_batch_size_per_gpu"] = 8
ds_config["train_batch_size"] = 8 * world_size * gradient_accumulation_steps
# Create the SFT model
model = create_hf_model(
model_class=AutoModelForCausalLM,
model_name_or_path="facebook/opt-1.3b",
tokenizer=tokenizer,
ds_config=ds_config,
rlhf_training=False,
dropout=None
)
# The model is now ready for deepspeed.initialize() and SFT training
model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
optimizer=optimizer,
config=ds_config,
lr_scheduler=lr_scheduler,
dist_init_required=True
)
Key points in this example:
rlhf_training=Falseensures full pre-trained weights are loaded viafrom_pretrained.- ZeRO Stage 2 partitions optimizer states and gradients across GPUs but keeps the full model on each device.
- The returned model has its embedding table resized to a multiple of 8, and
pad_token_idset toeos_token_id.