Principle:FMInference FlexLLMGen Model Weight Downloading
Metadata
| Field | Value |
|---|---|
| source Repo | FlexLLMGen |
| source Doc | HuggingFace Hub |
Domains
- Model_Preparation
- Data_Pipeline
Last Updated
2026-02-09 00:00 GMT
Overview
A model preparation pipeline that downloads pre-trained weights from HuggingFace Hub and converts them from PyTorch checkpoint format to NumPy arrays for efficient memory-mapped loading.
Description
Large language models are distributed as PyTorch .bin checkpoint files on HuggingFace Hub. FlexLLMGen converts these to individual NumPy .npy files (one per parameter tensor) to enable memory-mapped loading without requiring the full model to fit in memory. The download_opt_weights function handles snapshot_download from HuggingFace, iterates over checkpoint shards, renames parameter keys (e.g., removing "model." prefix), and saves each as a separate .npy file. It also handles shared embeddings (copying embed_tokens.weight to lm_head.weight).
Usage
Run download_opt_weights before first inference with a new model. The converted weights are cached locally and reused for subsequent runs.
Theoretical Basis
NumPy format enables memory-mapped file access, allowing the system to load individual layer weights on demand without reading the entire checkpoint into memory. This is essential for models that exceed available RAM.