Principle:Ggml org Llama cpp Model Acquisition
| Field | Value |
|---|---|
| Principle Name | Model Acquisition |
| Category | Data Sourcing |
| Scope | Obtaining pre-trained model weights from model hubs |
| Status | Active |
Overview
Description
Before a model can be converted from one format to another, its weights, configuration files, and tokenizer assets must be obtained from a source repository. In the modern ML ecosystem, model hubs serve as centralized registries for pre-trained models. The dominant hub is HuggingFace Hub, which hosts tens of thousands of models in standardized directory layouts.
Model acquisition involves downloading the following artifacts:
- Model weights: Serialized tensor data in formats such as SafeTensors (
.safetensors) or PyTorch checkpoints (.bin). These may be split across multiple shard files for large models. - Configuration files: JSON files (
config.json,generation_config.json) that describe the model architecture, hyperparameters, and generation settings. - Tokenizer files: Vocabulary files (
tokenizer.json,tokenizer.model,tokenizer_config.json) that define how text is segmented into tokens. - Metadata: License files, model cards (
README.md), and other documentation.
A key design decision in acquisition is whether to download all model files or only a subset. For conversion pipelines that read tensors remotely (streaming from the hub without full download), only configuration and tokenizer files are needed locally. For fully local conversion, all weight files must be present.
Usage
Model acquisition is the first step in any conversion workflow. The process follows this general pattern:
- Identify the model by its hub repository ID (e.g.,
meta-llama/Llama-3.1-8B-Instruct) - Determine which files are needed based on the conversion mode (full local vs. remote streaming)
- Download the required files to a local directory, optionally filtering by file pattern
- Verify that the download is complete and the directory structure matches expectations
For gated or restricted models, authentication via an API token is required before download.
Theoretical Basis
Model acquisition draws on principles from artifact management and content-addressable storage:
Snapshot consistency: A model repository may be updated at any time (new revisions, corrected weights, updated tokenizers). Acquisition should capture a consistent snapshot, meaning all files correspond to the same revision. Hub APIs typically support revision pinning via commit hashes or tags.
Selective download: Large language models can exceed hundreds of gigabytes. Downloading only the files needed for a specific task (e.g., configuration and tokenizer for remote conversion) reduces bandwidth, storage, and time. Pattern-based filtering (e.g., allow_patterns=["*.json", "*.txt", "tokenizer.model"]) provides this selectivity.
Authentication and access control: Some models require acceptance of license terms or organizational membership before download. The acquisition mechanism must integrate with the hub's authentication system, typically via bearer tokens set as environment variables.
Caching and deduplication: Repeated downloads of the same model version waste resources. Hub client libraries typically maintain a local cache keyed by repository ID and revision, allowing subsequent runs to reuse previously downloaded files.
Integrity verification: Downloaded files should be verified against checksums provided by the hub to detect corruption or tampering during transit.