Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Trl PPO Model Loading Pattern

From Leeroopedia


Property Value
Implementation Name PPO Model Loading Pattern
Technology Huggingface TRL, Transformers
Type Pattern Doc
Workflow PPO RLHF Training
Principle Principle:Huggingface_Trl_PPO_Multi_Model_Loading

Overview

Description

The PPO model loading pattern loads four models required for the full RLHF pipeline: a tokenizer with left padding, a value model and reward model (both sequence classifiers with num_labels=1), a policy (causal language model), and a reference policy (frozen copy of the policy). When PEFT is enabled, the reference policy is omitted since the base model weights serve as the reference.

Usage

This pattern is implemented in the PPO training script. The models are loaded independently and then passed to PPOTrainer.

Code Reference

Source Location

examples/scripts/ppo/ppo.py lines 97-121

Pattern

# Tokenizer with left-side padding for generation
tokenizer = AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    padding_side="left",
    trust_remote_code=model_args.trust_remote_code,
)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# Value model: initialized from reward model checkpoint
value_model = AutoModelForSequenceClassification.from_pretrained(
    training_args.reward_model_path,
    trust_remote_code=model_args.trust_remote_code,
    num_labels=1,
    **model_kwargs,
)

# Reward model: frozen scoring function
reward_model = AutoModelForSequenceClassification.from_pretrained(
    training_args.reward_model_path,
    trust_remote_code=model_args.trust_remote_code,
    num_labels=1,
    **model_kwargs,
)

# Policy: the causal LM to be optimized
policy = AutoModelForCausalLM.from_pretrained(
    training_args.sft_model_path,
    trust_remote_code=model_args.trust_remote_code,
    **model_kwargs,
)

# Reference policy: frozen copy for KL divergence (skipped when using PEFT)
peft_config = get_peft_config(model_args)
if peft_config is None:
    ref_policy = AutoModelForCausalLM.from_pretrained(
        training_args.sft_model_path,
        trust_remote_code=model_args.trust_remote_code,
        **model_kwargs,
    )
else:
    ref_policy = None

Import

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification
from trl import ModelConfig, get_peft_config, get_quantization_config, get_kbit_device_map

I/O Contract

Inputs

Parameter Type Source Description
model_args.model_name_or_path str ModelConfig Base model identifier (for tokenizer)
training_args.sft_model_path str PPOConfig Path to the supervised fine-tuned model (for policy and ref_policy)
training_args.reward_model_path str PPOConfig Path to the trained reward model (for reward_model and value_model)
model_kwargs dict Computed Contains revision, attn_implementation, dtype, and optionally quantization_config and device_map

Outputs

Output Type Description
tokenizer PreTrainedTokenizerBase Tokenizer with padding_side="left" and explicit pad_token
policy AutoModelForCausalLM Trainable causal language model (actor)
ref_policy AutoModelForCausalLM or None Frozen reference policy; None when using PEFT
reward_model AutoModelForSequenceClassification Frozen reward scoring model (num_labels=1)
value_model AutoModelForSequenceClassification Trainable value estimator (critic, num_labels=1)

Usage Examples

Standard Four-Model Loading

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification

sft_path = "my-sft-model"
reward_path = "my-reward-model"

tokenizer = AutoTokenizer.from_pretrained(sft_path, padding_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})

model_kwargs = {"torch_dtype": torch.bfloat16}

value_model = AutoModelForSequenceClassification.from_pretrained(
    reward_path, num_labels=1, **model_kwargs
)
reward_model = AutoModelForSequenceClassification.from_pretrained(
    reward_path, num_labels=1, **model_kwargs
)
policy = AutoModelForCausalLM.from_pretrained(sft_path, **model_kwargs)
ref_policy = AutoModelForCausalLM.from_pretrained(sft_path, **model_kwargs)

With PEFT (Three Models)

from peft import LoraConfig

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    task_type="CAUSAL_LM",
)

# No reference policy needed when using PEFT
ref_policy = None
# The PPOTrainer will wrap the policy with LoRA adapters
# and use the base model weights as the reference

With Quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_kwargs = {
    "torch_dtype": torch.bfloat16,
    "quantization_config": quantization_config,
    "device_map": {"": 0},
}

# Load all models with 4-bit quantization
value_model = AutoModelForSequenceClassification.from_pretrained(
    reward_path, num_labels=1, **model_kwargs
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment