Principle:FMInference FlexLLMGen Checkpoint Format Abstraction
| Field | Value |
|---|---|
| Sources | Upstream: DeepSpeed, Paper: FlexGen |
| Domains | Checkpointing, Model_Loading |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A factory-based abstraction that decouples checkpoint loading from the specific checkpoint format and parallelism configuration, enabling transparent loading of model weights across different training frameworks, parallelism strategies, and deployment configurations.
Description
Checkpoint format abstraction addresses the challenge that pre-trained models are saved in many different formats and parallelism configurations, but the system loading them may use a different configuration. A model trained with 8-way tensor parallelism on Megatron-LM should be loadable for inference with 4-way parallelism on DeepSpeed without manual conversion.
The abstraction consists of three layers:
- Format detection -- A JSON manifest describes the checkpoint type (Megatron, BLOOM, DS model), the list of checkpoint files, the parallelization strategy (tensor parallel, pipeline parallel), and the original model-parallel size. The factory reads this manifest and selects the appropriate loader.
- Parallelism adaptation -- The loader handles three cases transparently:
- Same MP size -- Direct load from the matching checkpoint shard.
- Fewer shards than ranks (split) -- A single checkpoint is split across multiple runtime ranks by partitioning weight tensors along the appropriate dimension.
- More shards than ranks (merge) -- Multiple checkpoint shards are merged by concatenating weight tensors, reconstructing the full parameter before re-partitioning.
- Post-processing -- After loading and adapting the parallelism, optional post-processing steps include weight quantization (for inference efficiency) and module key resolution (auto-detecting the model submodule within the checkpoint's nested dictionary structure).
The key design principle is that the consumer of the loaded state dict does not need to know anything about the original checkpoint format or parallelism configuration. The factory handles all adaptation internally, presenting a uniform state dict interface.
Usage
Use checkpoint format abstraction when building systems that must load models from multiple sources (HuggingFace Hub, Megatron checkpoints, custom training runs) with potentially different parallelism configurations. This is essential for FlexLLMGen's benchmark suite, which evaluates models trained with various frameworks.
Theoretical Basis
The factory pattern provides polymorphic object creation: the correct loader implementation is selected at runtime based on the checkpoint metadata, without the caller needing to know the concrete class. The parallelism adaptation logic follows the mathematical property that tensor-parallel weight partitions can be recombined and re-split along the same dimensions, enabling arbitrary MP size conversions.