Principle:Huggingface Optimum Model Download and Configuration
Overview
Process of downloading model artifacts from the Hugging Face Hub and loading the model configuration for tensor parallelization.
Description
Before parallelizing a model, its weights and configuration must be available locally. The download process handles Hub authentication, revision selection, caching, and optional weight skipping (for meta-device initialization). The model configuration (AutoConfig) provides architecture details needed for parallelization decisions (number of layers, hidden size, attention heads, etc.).
The download pipeline performs the following steps:
- Resolve the model identifier (local path or Hub repository name).
- Authenticate with the Hugging Face Hub if required.
- Download the model index file (model.safetensors.index.json) to determine shard layout.
- Download individual safetensors shard files into the local cache directory.
- Return the local snapshot path for downstream consumption.
When skip_download_weights is enabled, only the configuration and index files are downloaded, avoiding the cost of transferring large weight files. This is used in conjunction with meta-device initialization, where the model structure is created without real weight data.
Usage
Use as the first step in the tensor parallelization pipeline. This function must be called before any model construction or weight loading can occur.
| Parameter | Description |
|---|---|
| model_name_or_path | A Hugging Face Hub model identifier (e.g., meta-llama/Llama-2-7b-hf) or a local filesystem path. |
| cache_dir | Local directory for caching downloaded files. |
| revision | Optional Git revision (branch, tag, or commit hash) to download from. |
| local_files_only | If True, only look for files in the local cache without contacting the Hub. |
| skip_download_weights | If True, skip downloading the actual weight files (useful for meta-device initialization). |
Theoretical Basis
Repository-based model distribution. Models are stored as safetensors files with a JSON index mapping parameter names to shard files. The download process uses file locking for concurrent safety and supports selective downloading (config-only or full weights).
The safetensors format provides:
- Memory-mapped access to individual tensors without loading the full file.
- Partial reads for loading only the slice of a tensor relevant to a specific tensor-parallel rank.
- Zero-copy deserialization for efficient weight loading.
The index file (model.safetensors.index.json) contains a weight_map dictionary that maps each parameter name to the shard file that contains it, enabling targeted file downloads and reads.
Metadata
| Key | Value |
|---|---|
| Source Repository | Huggingface Optimum |
| Domains | Model_Loading, Distributed_Computing |
Related
- Implemented by: Implementation:Huggingface_Optimum_Download_Model_From_HF
- Used by: Principle:Huggingface_Optimum_Meta_Device_Initialization
- Used by: Principle:Huggingface_Optimum_Sharded_Weight_Loading