Implementation:Ggml org Llama cpp Convert Legacy Llama

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Model_Conversion
Last Updated	2026-02-15 00:00 GMT

Overview

Python script for converting legacy LLaMA model weights (PyTorch checkpoints) to the GGUF format used by llama.cpp.

Description

This module defines data type classes (DataType, UnquantizedDataType, QuantizedDataType, Q8_0QuantizedDataType) for F16, F32, BF16, and Q8_0 formats. It implements model parameter parsing via Params, lazy tensor loading from PyTorch pickle/zip files or memory-mapped storage via LazyTensor and LazyUnpickler, and sharded model merging. The OutputFile class uses the gguf Python library to write output files with proper metadata (architecture, tokenizer vocab from SentencePiece/BPE/HuggingFace formats). The VocabFactory handles vocabulary loading from multiple sources. Concurrent processing is supported for performance.

Usage

Run this script to convert original LLaMA model weights from Meta's PyTorch checkpoint format into GGUF format. It handles legacy formats that the newer convert_hf_to_gguf.py may not support, including sharded checkpoints and original SentencePiece tokenizers.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/convert_legacy_llama.py
Lines: 1-1462

Signature

@dataclass(frozen=True)
class DataType:
    name: str
    dtype: np.dtype
    valid_conversions: list[str]
    def elements_to_bytes(self, n_elements: int) -> int: ...

@dataclass(frozen=True)
class UnquantizedDataType(DataType):
    pass

DT_F16  = UnquantizedDataType('F16',  dtype=np.dtype(np.float16), ...)
DT_F32  = UnquantizedDataType('F32',  dtype=np.dtype(np.float32), ...)
DT_BF16 = UnquantizedDataType('BF16', dtype=np.dtype(np.uint16), ...)

class Params:
    @staticmethod
    def loadOriginalParamsJson(model, ftype) -> Params: ...

class OutputFile:
    def write_vocab_only(fname_out, params, vocab, ...) -> None: ...
    def write_all(fname_out, ftype, params, model, vocab, ...) -> None: ...

class VocabFactory:
    def load_vocab(self, vocab_types, model_parent_path) -> Vocab: ...

def main(args_in=None) -> None: ...

Import

from __future__ import annotations
import argparse
import concurrent.futures
import enum
import numpy as np
import gguf
from gguf import BaseVocab, Vocab, NoVocab, BpeVocab, SentencePieceVocab, LlamaHfVocab

I/O Contract

Inputs

Name	Type	Required	Description
model_dir	Path	Yes	Directory containing PyTorch checkpoint files (consolidated.*.pth)
--outtype	str	No	Output data type: f32, f16, or q8_0 (default: f16)
--outfile	Path	No	Custom output filename (default: derived from model name)
--vocab-type	str	No	Vocabulary type: spm, bpe, or hfft
--vocab-dir	Path	No	Directory containing tokenizer files
--concurrency	int	No	Number of concurrent workers (default: 8)

Outputs

Name	Type	Description
output_file	.gguf file	Converted model in GGUF format with metadata, vocabulary, and quantized/unquantized tensors

Usage Examples

# Convert original LLaMA weights to GGUF (f16)
# python examples/convert_legacy_llama.py /path/to/llama-7b/

# Convert with specific output type and vocabulary
# python examples/convert_legacy_llama.py /path/to/llama-7b/ \
#     --outtype f32 \
#     --vocab-type spm \
#     --outfile llama-7b-f32.gguf

# Convert with Q8_0 quantization
# python examples/convert_legacy_llama.py /path/to/llama-7b/ --outtype q8_0

Related Pages

Principle:Ggml_org_Llama_cpp_ModelConversion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment