Principle:FMInference FlexLLMGen Model Weight Downloading

Metadata

Field	Value
source Repo	FlexLLMGen
source Doc	HuggingFace Hub

Domains

Model_Preparation
Data_Pipeline

Last Updated

2026-02-09 00:00 GMT

Overview

A model preparation pipeline that downloads pre-trained weights from HuggingFace Hub and converts them from PyTorch checkpoint format to NumPy arrays for efficient memory-mapped loading.

Description

Large language models are distributed as PyTorch .bin checkpoint files on HuggingFace Hub. FlexLLMGen converts these to individual NumPy .npy files (one per parameter tensor) to enable memory-mapped loading without requiring the full model to fit in memory. The download_opt_weights function handles snapshot_download from HuggingFace, iterates over checkpoint shards, renames parameter keys (e.g., removing "model." prefix), and saves each as a separate .npy file. It also handles shared embeddings (copying embed_tokens.weight to lm_head.weight).

Usage

Run download_opt_weights before first inference with a new model. The converted weights are cached locally and reused for subsequent runs.

Theoretical Basis

NumPy format enables memory-mapped file access, allowing the system to load individual layer weights on demand without reading the entire checkpoint into memory. This is essential for models that exceed available RAM.

Related Pages

Implementation:FMInference_FlexLLMGen_Download_Opt_Weights

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment