Implementation:Ggml org Llama cpp Convert Legacy Llama
| Knowledge Sources | |
|---|---|
| Domains | Model_Conversion |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Python script for converting legacy LLaMA model weights (PyTorch checkpoints) to the GGUF format used by llama.cpp.
Description
This module defines data type classes (DataType, UnquantizedDataType, QuantizedDataType, Q8_0QuantizedDataType) for F16, F32, BF16, and Q8_0 formats. It implements model parameter parsing via Params, lazy tensor loading from PyTorch pickle/zip files or memory-mapped storage via LazyTensor and LazyUnpickler, and sharded model merging. The OutputFile class uses the gguf Python library to write output files with proper metadata (architecture, tokenizer vocab from SentencePiece/BPE/HuggingFace formats). The VocabFactory handles vocabulary loading from multiple sources. Concurrent processing is supported for performance.
Usage
Run this script to convert original LLaMA model weights from Meta's PyTorch checkpoint format into GGUF format. It handles legacy formats that the newer convert_hf_to_gguf.py may not support, including sharded checkpoints and original SentencePiece tokenizers.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/convert_legacy_llama.py
- Lines: 1-1462
Signature
@dataclass(frozen=True)
class DataType:
name: str
dtype: np.dtype
valid_conversions: list[str]
def elements_to_bytes(self, n_elements: int) -> int: ...
@dataclass(frozen=True)
class UnquantizedDataType(DataType):
pass
DT_F16 = UnquantizedDataType('F16', dtype=np.dtype(np.float16), ...)
DT_F32 = UnquantizedDataType('F32', dtype=np.dtype(np.float32), ...)
DT_BF16 = UnquantizedDataType('BF16', dtype=np.dtype(np.uint16), ...)
class Params:
@staticmethod
def loadOriginalParamsJson(model, ftype) -> Params: ...
class OutputFile:
def write_vocab_only(fname_out, params, vocab, ...) -> None: ...
def write_all(fname_out, ftype, params, model, vocab, ...) -> None: ...
class VocabFactory:
def load_vocab(self, vocab_types, model_parent_path) -> Vocab: ...
def main(args_in=None) -> None: ...
Import
from __future__ import annotations
import argparse
import concurrent.futures
import enum
import numpy as np
import gguf
from gguf import BaseVocab, Vocab, NoVocab, BpeVocab, SentencePieceVocab, LlamaHfVocab
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_dir | Path | Yes | Directory containing PyTorch checkpoint files (consolidated.*.pth) |
| --outtype | str | No | Output data type: f32, f16, or q8_0 (default: f16) |
| --outfile | Path | No | Custom output filename (default: derived from model name) |
| --vocab-type | str | No | Vocabulary type: spm, bpe, or hfft |
| --vocab-dir | Path | No | Directory containing tokenizer files |
| --concurrency | int | No | Number of concurrent workers (default: 8) |
Outputs
| Name | Type | Description |
|---|---|---|
| output_file | .gguf file | Converted model in GGUF format with metadata, vocabulary, and quantized/unquantized tensors |
Usage Examples
# Convert original LLaMA weights to GGUF (f16)
# python examples/convert_legacy_llama.py /path/to/llama-7b/
# Convert with specific output type and vocabulary
# python examples/convert_legacy_llama.py /path/to/llama-7b/ \
# --outtype f32 \
# --vocab-type spm \
# --outfile llama-7b-f32.gguf
# Convert with Q8_0 quantization
# python examples/convert_legacy_llama.py /path/to/llama-7b/ --outtype q8_0