Principle:OpenRLHF OpenRLHF Model Checkpointing
| Knowledge Sources | |
|---|---|
| Domains | Training_Infrastructure, Distributed_Computing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A persistence pattern that saves trained model weights from distributed DeepSpeed processes to a unified HuggingFace-compatible format on disk.
Description
Model Checkpointing in distributed training requires gathering sharded model parameters from all processes and saving them in a format usable for inference or further training. This principle handles ZeRO-3 parameter gathering, LoRA adapter extraction, and conversion to HuggingFace model format. It supports saving the full model or only LoRA adapter weights.
Usage
Use at the end of any training workflow to persist the trained model, or at intermediate checkpoints for fault tolerance. The saved model is compatible with HuggingFace's from_pretrained loading.
Theoretical Basis
In ZeRO-3 training, model parameters are sharded across all processes. Saving requires:
- Parameter gathering: Each rank gathers full parameters from all other ranks
- State dict construction: Only rank 0 constructs the full state dict
- LoRA extraction: If using LoRA, only adapter weights are saved via PEFT
- Disk writing: Model and tokenizer saved to output directory