Implementation:Microsoft DeepSpeedExamples Create HF Model

Overview

Concrete tool for initializing HuggingFace causal language models for SFT training provided by the DeepSpeed-Chat library.

Description

The create_hf_model function is a model initialization utility in the DeepSpeed-Chat pipeline. It handles the full lifecycle of preparing a HuggingFace pre-trained model for distributed fine-tuning:

Configuration loading: Retrieves the model's AutoConfig from the specified model name or local path.
Dropout override: Optionally reconfigures dropout rates (dropout, attention_dropout, hidden_dropout, activation_dropout) across the model configuration via the configure_dropout helper.
DeepSpeed ZeRO-3 integration: When a DeepSpeed config specifying ZeRO Stage 3 is provided, instantiates an HfDeepSpeedConfig object. This ensures that model weights are partitioned across GPUs during loading rather than being fully materialized on each device.
Model loading: Loads pre-trained weights via from_pretrained for standard training, or uses from_config with no_init_weights when rlhf_training=True (weight loading is deferred to a separate checkpoint-loading step in the RLHF pipeline).
Tokenizer alignment: Sets the model's end_token_id and pad_token_id to the tokenizer's EOS token, ensuring consistent special token handling.
Embedding resizing: Resizes the token embedding matrix to the nearest multiple of 8 that accommodates the tokenizer vocabulary. This alignment to multiples of 8 optimizes GPU tensor core utilization.

Usage

Import create_hf_model when initializing any HuggingFace CausalLM model for RLHF training phases. It is used in:

SFT (Step 1): Loads the pre-trained base model for supervised fine-tuning on instruction-following data.
Reward Model creation: Called internally by create_critic_model to initialize the base transformer before wrapping it in a RewardModel head.
RLHF (Step 3): Initializes actor and reference models for PPO training.

Code Reference

Source

Repository	File
DeepSpeedExamples	`applications/DeepSpeed-Chat/dschat/utils/model/model_utils.py`

Signature

def create_hf_model(
    model_class,
    model_name_or_path,
    tokenizer,
    ds_config=None,
    rlhf_training=False,
    dropout=None
) -> nn.Module:

Import

from dschat.utils.model.model_utils import create_hf_model

I/O Contract

Inputs

Parameter	Type	Required	Description
`model_class`	`type`	Yes	HuggingFace model class, e.g. `AutoModelForCausalLM`
`model_name_or_path`	`str`	Yes	HuggingFace model identifier or local filesystem path to pre-trained weights
`tokenizer`	`AutoTokenizer`	Yes	HuggingFace tokenizer instance (used for EOS/pad token IDs and vocabulary size)
`ds_config`	`dict`	No	DeepSpeed configuration dictionary; triggers ZeRO-3 integration when `zero_optimization.stage == 3`
`rlhf_training`	`bool`	No	When `True`, skips weight loading (uses `from_config` with `no_init_weights`); enables deferred checkpoint loading for RLHF
`dropout`	`float`	No	Override value for all dropout rates in the model configuration

Outputs

Name	Type	Description
`model`	`nn.Module`	Initialized HuggingFace model with resized token embeddings, aligned pad/EOS tokens, and optional ZeRO-3 partitioning

Usage Examples

The following example demonstrates loading OPT-1.3B for SFT training with DeepSpeed ZeRO-2, matching the configuration used in the Step 1 training scripts:

from transformers import AutoModelForCausalLM
from dschat.utils.model.model_utils import create_hf_model
from dschat.utils.utils import load_hf_tokenizer
from dschat.utils.ds_utils import get_train_ds_config

# Load tokenizer
tokenizer = load_hf_tokenizer("facebook/opt-1.3b", fast_tokenizer=True)

# Configure DeepSpeed with ZeRO Stage 2
ds_config = get_train_ds_config(offload=False, dtype="fp16", stage=2)
ds_config["train_micro_batch_size_per_gpu"] = 8
ds_config["train_batch_size"] = 8 * world_size * gradient_accumulation_steps

# Create the SFT model
model = create_hf_model(
    model_class=AutoModelForCausalLM,
    model_name_or_path="facebook/opt-1.3b",
    tokenizer=tokenizer,
    ds_config=ds_config,
    rlhf_training=False,
    dropout=None
)

# The model is now ready for deepspeed.initialize() and SFT training
model, optimizer, _, lr_scheduler = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config,
    lr_scheduler=lr_scheduler,
    dist_init_required=True
)

Key points in this example:

rlhf_training=False ensures full pre-trained weights are loaded via from_pretrained.
ZeRO Stage 2 partitions optimizer states and gradients across GPUs but keeps the full model on each device.
The returned model has its embedding table resized to a multiple of 8, and pad_token_id set to eos_token_id.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment