Principle:FMInference FlexLLMGen Checkpoint Format Abstraction

Field	Value
Sources	Upstream: DeepSpeed, Paper: FlexGen
Domains	Checkpointing, Model_Loading
Last Updated	2026-02-09 00:00 GMT

Overview

A factory-based abstraction that decouples checkpoint loading from the specific checkpoint format and parallelism configuration, enabling transparent loading of model weights across different training frameworks, parallelism strategies, and deployment configurations.

Description

Checkpoint format abstraction addresses the challenge that pre-trained models are saved in many different formats and parallelism configurations, but the system loading them may use a different configuration. A model trained with 8-way tensor parallelism on Megatron-LM should be loadable for inference with 4-way parallelism on DeepSpeed without manual conversion.

The abstraction consists of three layers:

Format detection -- A JSON manifest describes the checkpoint type (Megatron, BLOOM, DS model), the list of checkpoint files, the parallelization strategy (tensor parallel, pipeline parallel), and the original model-parallel size. The factory reads this manifest and selects the appropriate loader.

Parallelism adaptation -- The loader handles three cases transparently:
- Same MP size -- Direct load from the matching checkpoint shard.
- Fewer shards than ranks (split) -- A single checkpoint is split across multiple runtime ranks by partitioning weight tensors along the appropriate dimension.
- More shards than ranks (merge) -- Multiple checkpoint shards are merged by concatenating weight tensors, reconstructing the full parameter before re-partitioning.

Post-processing -- After loading and adapting the parallelism, optional post-processing steps include weight quantization (for inference efficiency) and module key resolution (auto-detecting the model submodule within the checkpoint's nested dictionary structure).

The key design principle is that the consumer of the loaded state dict does not need to know anything about the original checkpoint format or parallelism configuration. The factory handles all adaptation internally, presenting a uniform state dict interface.

Usage

Use checkpoint format abstraction when building systems that must load models from multiple sources (HuggingFace Hub, Megatron checkpoints, custom training runs) with potentially different parallelism configurations. This is essential for FlexLLMGen's benchmark suite, which evaluates models trained with various frameworks.

Theoretical Basis

The factory pattern provides polymorphic object creation: the correct loader implementation is selected at runtime based on the checkpoint metadata, without the caller needing to know the concrete class. The parallelism adaptation logic follows the mathematical property that tensor-parallel weight partitions can be recombined and re-split along the same dimensions, enabling arbitrary MP size conversions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment