Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft DeepSpeedExamples Create HF Model

From Leeroopedia


Template:Metadata

Overview

Concrete tool for initializing HuggingFace causal language models for SFT training provided by the DeepSpeed-Chat library.

Description

The create_hf_model function is a model initialization utility in the DeepSpeed-Chat pipeline. It handles the full lifecycle of preparing a HuggingFace pre-trained model for distributed fine-tuning:

  1. Configuration loading: Retrieves the model's AutoConfig from the specified model name or local path.
  2. Dropout override: Optionally reconfigures dropout rates (dropout, attention_dropout, hidden_dropout, activation_dropout) across the model configuration via the configure_dropout helper.
  3. DeepSpeed ZeRO-3 integration: When a DeepSpeed config specifying ZeRO Stage 3 is provided, instantiates an HfDeepSpeedConfig object. This ensures that model weights are partitioned across GPUs during loading rather than being fully materialized on each device.
  4. Model loading: Loads pre-trained weights via from_pretrained for standard training, or uses from_config with no_init_weights when rlhf_training=True (weight loading is deferred to a separate checkpoint-loading step in the RLHF pipeline).
  5. Tokenizer alignment: Sets the model's end_token_id and pad_token_id to the tokenizer's EOS token, ensuring consistent special token handling.
  6. Embedding resizing: Resizes the token embedding matrix to the nearest multiple of 8 that accommodates the tokenizer vocabulary. This alignment to multiples of 8 optimizes GPU tensor core utilization.

Usage

Import create_hf_model when initializing any HuggingFace CausalLM model for RLHF training phases. It is used in:

  • SFT (Step 1): Loads the pre-trained base model for supervised fine-tuning on instruction-following data.
  • Reward Model creation: Called internally by create_critic_model to initialize the base transformer before wrapping it in a RewardModel head.
  • RLHF (Step 3): Initializes actor and reference models for PPO training.

Code Reference

Source

Repository File
DeepSpeedExamples applications/DeepSpeed-Chat/dschat/utils/model/model_utils.py

Signature

def create_hf_model(
    model_class,
    model_name_or_path,
    tokenizer,
    ds_config=None,
    rlhf_training=False,
    dropout=None
) -> nn.Module:

Import

from dschat.utils.model.model_utils import create_hf_model

I/O Contract

Inputs

Parameter Type Required Description
model_class type Yes HuggingFace model class, e.g. AutoModelForCausalLM
model_name_or_path str Yes HuggingFace model identifier or local filesystem path to pre-trained weights
tokenizer AutoTokenizer Yes HuggingFace tokenizer instance (used for EOS/pad token IDs and vocabulary size)
ds_config dict No DeepSpeed configuration dictionary; triggers ZeRO-3 integration when zero_optimization.stage == 3
rlhf_training bool No When True, skips weight loading (uses from_config with no_init_weights); enables deferred checkpoint loading for RLHF
dropout float No Override value for all dropout rates in the model configuration

Outputs

Name Type Description
model nn.Module Initialized HuggingFace model with resized token embeddings, aligned pad/EOS tokens, and optional ZeRO-3 partitioning

Usage Examples

The following example demonstrates loading OPT-1.3B for SFT training with DeepSpeed ZeRO-2, matching the configuration used in the Step 1 training scripts:

from transformers import AutoModelForCausalLM
from dschat.utils.model.model_utils import create_hf_model
from dschat.utils.utils import load_hf_tokenizer
from dschat.utils.ds_utils import get_train_ds_config

# Load tokenizer
tokenizer = load_hf_tokenizer("facebook/opt-1.3b", fast_tokenizer=True)

# Configure DeepSpeed with ZeRO Stage 2
ds_config = get_train_ds_config(offload=False, dtype="fp16", stage=2)
ds_config["train_micro_batch_size_per_gpu"] = 8
ds_config["train_batch_size"] = 8 * world_size * gradient_accumulation_steps

# Create the SFT model
model = create_hf_model(
    model_class=AutoModelForCausalLM,
    model_name_or_path="facebook/opt-1.3b",
    tokenizer=tokenizer,
    ds_config=ds_config,
    rlhf_training=False,
    dropout=None
)

# The model is now ready for deepspeed.initialize() and SFT training
model, optimizer, _, lr_scheduler = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config,
    lr_scheduler=lr_scheduler,
    dist_init_required=True
)

Key points in this example:

  • rlhf_training=False ensures full pre-trained weights are loaded via from_pretrained.
  • ZeRO Stage 2 partitions optimizer states and gradients across GPUs but keeps the full model on each device.
  • The returned model has its embedding table resized to a multiple of 8, and pad_token_id set to eos_token_id.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment