Implementation:Ggml_org_Ggml_Gpt2_model_load

Summary

gpt2_model_load is a C++ function defined in the GGML GPT-2 example that reads a GGML-format binary file containing a pre-trained GPT-2 model and populates in-memory structures with hyperparameters, BPE vocabulary, and weight tensors distributed across CPU and (optionally) GPU backends. It is the concrete loader that bridges a serialized GPT-2 checkpoint and the GGML inference runtime.

Import

N/A (defined in example, not a library export)

The function is defined directly in the example application source and is not part of the public GGML library API.

Dependencies

ggml.h -- core tensor types, context management, and tensor creation functions.
ggml-cpu.h -- CPU backend initialization and buffer allocation.
ggml-alloc.h -- tensor memory allocation helpers (ggml_tallocr).
ggml-backend.h -- backend-agnostic buffer and tensor transfer API.
ggml-cuda.h (optional) -- CUDA backend for NVIDIA GPU offloading.
ggml-metal.h (optional) -- Metal backend for Apple GPU offloading.

Function Signature

bool gpt2_model_load(
    const std::string & fname,
    gpt2_model        & model,
    gpt_vocab         & vocab,
    int                 n_ctx,
    int                 n_gpu_layers);

Source: examples/gpt-2/main-backend.cpp:L103-443

Parameters

Parameter	Type	Description
fname	`const std::string &`	Path to the GGML binary file containing the serialized GPT-2 model weights, vocabulary, and hyperparameters.
model	`gpt2_model &`	Output struct that receives all loaded weight tensors (stored in `buffer_w`), the KV cache allocation (stored in `buffer_kv`), and the backend handle.
vocab	`gpt_vocab &`	Output vocabulary struct populated with bidirectional token maps (string-to-id and id-to-string) for the BPE tokenizer.
n_ctx	`int`	Context size override. If greater than zero, replaces the context length read from the file header, allowing the caller to increase or decrease the sequence length at load time.
n_gpu_layers	`int`	Number of transformer layers to offload to the GPU backend. Layers with index less than this value are placed in GPU memory; remaining layers stay on the CPU.

Return Value

Returns bool:

true -- the model was loaded successfully. The model struct contains fully populated weight tensors in buffer_w, a pre-allocated KV cache in buffer_kv, and a valid backend handle. The vocab struct contains the bidirectional token maps.
false -- loading failed (e.g., file not found, magic number mismatch, read error, or unsupported format).

I/O Contract

Inputs (File Format)

The function expects a GGML binary file with the following sequential layout:

Section	Format	Details
Magic number	4 bytes, little-endian	Must equal `0x67676d6c` (ASCII `"ggml"`).
Hyperparameters	7 x `int32_t`	`n_vocab`, `n_ctx`, `n_embd`, `n_head`, `n_layer`, `ftype` (and padding/version fields as applicable).
Vocabulary	Repeated: 4-byte length + UTF-8 bytes	`n_vocab` BPE tokens, each preceded by a 32-bit length.
Weight tensors	Repeated: n_dims, name_length, ftype, dims[], name, data	Each tensor record encodes its dimensionality, name, quantization type, shape, and raw weight bytes.

Outputs (In-Memory Structures)

Output	Type	Contents
model.buffer_w	`ggml_backend_buffer_t`	Backend buffer holding all weight tensors. Tensors for GPU-offloaded layers reside in device memory; others in CPU memory.
model.buffer_kv	`ggml_backend_buffer_t`	Pre-allocated buffer for the key-value cache used during autoregressive generation.
model.backend	`ggml_backend_t`	Handle to the selected compute backend (GPU or CPU).
model.hparams	struct	Parsed hyperparameters (`n_vocab`, `n_ctx`, `n_embd`, `n_head`, `n_layer`, `ftype`).
model.tensors	named tensor map	All weight tensors (embeddings, layer norms, attention projections, MLP weights) accessible by name.
vocab.token_to_id	`std::map<std::string, int32_t>`	Forward mapping from token string to integer ID.
vocab.id_to_token	`std::map<int32_t, std::string>`	Reverse mapping from integer ID to token string.

Code Reference

Source Location

Repository: GGML
File: examples/gpt-2/main-backend.cpp
Lines: 103-443

Loading Sequence (Pseudocode)

1. Open binary file at fname
2. Read and validate magic number (0x67676d6c)
3. Read hyperparameters into model.hparams
4. Override n_ctx if caller-supplied value > 0
5. Read n_vocab tokens into vocab maps
6. Determine ftype -> ggml_type mapping for weight precision
7. Create ggml_context for tensor metadata (no_alloc = true)
8. Define all model tensors (wte, wpe, per-layer weights, ln_f)
9. Initialize backend (GPU if n_gpu_layers > 0, else CPU)
10. Allocate buffer_w on backend, assign tensors to buffer
11. For each tensor in file:
      a. Read tensor header (n_dims, name_length, ftype)
      b. Read dimension sizes and tensor name
      c. Look up corresponding ggml_tensor by name
      d. Read raw data into backend buffer via ggml_backend_tensor_set
12. Allocate buffer_kv for KV cache
13. Return true on success

Usage Example

#include "ggml.h"
#include "ggml-backend.h"
#include <string>

// Model and vocabulary structs (defined in example)
gpt2_model model;
gpt_vocab  vocab;

// Load GPT-2 model from GGML binary
// - Use default context size from file (n_ctx = 0)
// - Offload first 6 layers to GPU
bool ok = gpt2_model_load("models/gpt-2-117M/ggml-model.bin",
                           model, vocab,
                           /*n_ctx=*/0,
                           /*n_gpu_layers=*/6);
if (!ok) {
    fprintf(stderr, "Failed to load model\n");
    return 1;
}

// model.backend, model.buffer_w, model.buffer_kv are now ready
// vocab.token_to_id and vocab.id_to_token are populated
// Proceed to build computation graph and run inference...

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment