Heuristic:Zai org CogVideo DeepSpeed Checkpoint Conversion Tips
| Knowledge Sources | |
|---|---|
| Domains | DeepSpeed, Checkpointing, Model_Conversion |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
DeepSpeed checkpoint conversion requires substantial CPU RAM (doubles for ZeRO-2), shard-by-shard processing with explicit gc.collect(), and bf16 conversion for storage. Use zero_to_fp32.py for SFT models before inference.
Description
Converting DeepSpeed ZeRO checkpoints to HuggingFace format is a memory-intensive process with several non-obvious pitfalls. The conversion tool must handle ZeRO Stage 1/2/3 checkpoint formats, merge sharded optimizer states, convert buffers from fp16 back to fp32 for correctness, and manage RAM usage carefully. The codebase includes warnings about memory doubling during ZeRO-2 merging and the need for out-of-core computing for models too large to fit in RAM.
Usage
Apply these tips when converting trained CogVideoX checkpoints from DeepSpeed format to HuggingFace Diffusers format for inference or distribution.
The Insight (Rule of Thumb)
- SFT models: Run `zero_to_fp32.py` in the `checkpoint-*/` directory to merge weights before inference.
- RAM requirement: Ensure CPU RAM is at least 2x the model size for ZeRO-2 conversion (memory doubles during merging).
- Shard processing: Delete each shard and call `gc.collect()` after saving to prevent OOM on large models.
- Buffer precision: Buffers (normalization running stats) saved in fp16 must be recovered to fp32 for correct behavior.
- Storage format: Convert final checkpoint to bf16 for storage, halving the checkpoint size.
- Multi-node: Checkpoint glob patterns may not work correctly for multi-node setups (acknowledged as untested).
Reasoning
Memory doubling warning from `tools/convert_weight_deepspeed2hf.py:282`:
# XXX: memory usage doubles here (zero2)
Out-of-core computing note from `tools/convert_weight_deepspeed2hf.py:306-307`:
# XXX: for huge models that can't fit into the host's RAM we will have to
# recode this to support out-of-core computing solution
Shard-by-shard cleanup from `tools/convert_weight_deepspeed2hf.py:729-734`:
# release the memory of current shard
for tensor_name in list(shard_state_dict.keys()):
del state_dict[tensor_name]
del shard_state_dict[tensor_name]
del shard_state_dict
gc.collect()
Buffer fp32 recovery from `tools/convert_weight_deepspeed2hf.py:122`:
# recover just the buffers while restoring them to fp32 if they were saved in fp16
buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
Multi-node warning from `tools/convert_weight_deepspeed2hf.py:94`:
# XXX: need to test that this simple glob rule works for multi-node setup too