Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenRLHF OpenRLHF Model Checkpointing

From Leeroopedia


Knowledge Sources
Domains Training_Infrastructure, Distributed_Computing
Last Updated 2026-02-07 00:00 GMT

Overview

A persistence pattern that saves trained model weights from distributed DeepSpeed processes to a unified HuggingFace-compatible format on disk.

Description

Model Checkpointing in distributed training requires gathering sharded model parameters from all processes and saving them in a format usable for inference or further training. This principle handles ZeRO-3 parameter gathering, LoRA adapter extraction, and conversion to HuggingFace model format. It supports saving the full model or only LoRA adapter weights.

Usage

Use at the end of any training workflow to persist the trained model, or at intermediate checkpoints for fault tolerance. The saved model is compatible with HuggingFace's from_pretrained loading.

Theoretical Basis

In ZeRO-3 training, model parameters are sharded across all processes. Saving requires:

  1. Parameter gathering: Each rank gathers full parameters from all other ranks
  2. State dict construction: Only rank 0 constructs the full state dict
  3. LoRA extraction: If using LoRA, only adapter weights are saved via PEFT
  4. Disk writing: Model and tokenizer saved to output directory

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment