Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Convert Legacy Llama

From Leeroopedia
Knowledge Sources
Domains Model_Conversion
Last Updated 2026-02-15 00:00 GMT

Overview

Python script for converting legacy LLaMA model weights (PyTorch checkpoints) to the GGUF format used by llama.cpp.

Description

This module defines data type classes (DataType, UnquantizedDataType, QuantizedDataType, Q8_0QuantizedDataType) for F16, F32, BF16, and Q8_0 formats. It implements model parameter parsing via Params, lazy tensor loading from PyTorch pickle/zip files or memory-mapped storage via LazyTensor and LazyUnpickler, and sharded model merging. The OutputFile class uses the gguf Python library to write output files with proper metadata (architecture, tokenizer vocab from SentencePiece/BPE/HuggingFace formats). The VocabFactory handles vocabulary loading from multiple sources. Concurrent processing is supported for performance.

Usage

Run this script to convert original LLaMA model weights from Meta's PyTorch checkpoint format into GGUF format. It handles legacy formats that the newer convert_hf_to_gguf.py may not support, including sharded checkpoints and original SentencePiece tokenizers.

Code Reference

Source Location

Signature

@dataclass(frozen=True)
class DataType:
    name: str
    dtype: np.dtype
    valid_conversions: list[str]
    def elements_to_bytes(self, n_elements: int) -> int: ...

@dataclass(frozen=True)
class UnquantizedDataType(DataType):
    pass

DT_F16  = UnquantizedDataType('F16',  dtype=np.dtype(np.float16), ...)
DT_F32  = UnquantizedDataType('F32',  dtype=np.dtype(np.float32), ...)
DT_BF16 = UnquantizedDataType('BF16', dtype=np.dtype(np.uint16), ...)

class Params:
    @staticmethod
    def loadOriginalParamsJson(model, ftype) -> Params: ...

class OutputFile:
    def write_vocab_only(fname_out, params, vocab, ...) -> None: ...
    def write_all(fname_out, ftype, params, model, vocab, ...) -> None: ...

class VocabFactory:
    def load_vocab(self, vocab_types, model_parent_path) -> Vocab: ...

def main(args_in=None) -> None: ...

Import

from __future__ import annotations
import argparse
import concurrent.futures
import enum
import numpy as np
import gguf
from gguf import BaseVocab, Vocab, NoVocab, BpeVocab, SentencePieceVocab, LlamaHfVocab

I/O Contract

Inputs

Name Type Required Description
model_dir Path Yes Directory containing PyTorch checkpoint files (consolidated.*.pth)
--outtype str No Output data type: f32, f16, or q8_0 (default: f16)
--outfile Path No Custom output filename (default: derived from model name)
--vocab-type str No Vocabulary type: spm, bpe, or hfft
--vocab-dir Path No Directory containing tokenizer files
--concurrency int No Number of concurrent workers (default: 8)

Outputs

Name Type Description
output_file .gguf file Converted model in GGUF format with metadata, vocabulary, and quantized/unquantized tensors

Usage Examples

# Convert original LLaMA weights to GGUF (f16)
# python examples/convert_legacy_llama.py /path/to/llama-7b/

# Convert with specific output type and vocabulary
# python examples/convert_legacy_llama.py /path/to/llama-7b/ \
#     --outtype f32 \
#     --vocab-type spm \
#     --outfile llama-7b-f32.gguf

# Convert with Q8_0 quantization
# python examples/convert_legacy_llama.py /path/to/llama-7b/ --outtype q8_0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment