Implementation:Ggml_org_Ggml_Gpt2_model_load
Summary
gpt2_model_load is a C++ function defined in the GGML GPT-2 example that reads a GGML-format binary file containing a pre-trained GPT-2 model and populates in-memory structures with hyperparameters, BPE vocabulary, and weight tensors distributed across CPU and (optionally) GPU backends. It is the concrete loader that bridges a serialized GPT-2 checkpoint and the GGML inference runtime.
Import
N/A (defined in example, not a library export)
The function is defined directly in the example application source and is not part of the public GGML library API.
Dependencies
- ggml.h -- core tensor types, context management, and tensor creation functions.
- ggml-cpu.h -- CPU backend initialization and buffer allocation.
- ggml-alloc.h -- tensor memory allocation helpers (
ggml_tallocr). - ggml-backend.h -- backend-agnostic buffer and tensor transfer API.
- ggml-cuda.h (optional) -- CUDA backend for NVIDIA GPU offloading.
- ggml-metal.h (optional) -- Metal backend for Apple GPU offloading.
Function Signature
bool gpt2_model_load(
const std::string & fname,
gpt2_model & model,
gpt_vocab & vocab,
int n_ctx,
int n_gpu_layers);
Source: examples/gpt-2/main-backend.cpp:L103-443
Parameters
| Parameter | Type | Description |
|---|---|---|
| fname | const std::string & |
Path to the GGML binary file containing the serialized GPT-2 model weights, vocabulary, and hyperparameters. |
| model | gpt2_model & |
Output struct that receives all loaded weight tensors (stored in buffer_w), the KV cache allocation (stored in buffer_kv), and the backend handle.
|
| vocab | gpt_vocab & |
Output vocabulary struct populated with bidirectional token maps (string-to-id and id-to-string) for the BPE tokenizer. |
| n_ctx | int |
Context size override. If greater than zero, replaces the context length read from the file header, allowing the caller to increase or decrease the sequence length at load time. |
| n_gpu_layers | int |
Number of transformer layers to offload to the GPU backend. Layers with index less than this value are placed in GPU memory; remaining layers stay on the CPU. |
Return Value
Returns bool:
true-- the model was loaded successfully. Themodelstruct contains fully populated weight tensors inbuffer_w, a pre-allocated KV cache inbuffer_kv, and a valid backend handle. Thevocabstruct contains the bidirectional token maps.false-- loading failed (e.g., file not found, magic number mismatch, read error, or unsupported format).
I/O Contract
Inputs (File Format)
The function expects a GGML binary file with the following sequential layout:
| Section | Format | Details |
|---|---|---|
| Magic number | 4 bytes, little-endian | Must equal 0x67676d6c (ASCII "ggml").
|
| Hyperparameters | 7 x int32_t |
n_vocab, n_ctx, n_embd, n_head, n_layer, ftype (and padding/version fields as applicable).
|
| Vocabulary | Repeated: 4-byte length + UTF-8 bytes | n_vocab BPE tokens, each preceded by a 32-bit length.
|
| Weight tensors | Repeated: n_dims, name_length, ftype, dims[], name, data | Each tensor record encodes its dimensionality, name, quantization type, shape, and raw weight bytes. |
Outputs (In-Memory Structures)
| Output | Type | Contents |
|---|---|---|
| model.buffer_w | ggml_backend_buffer_t |
Backend buffer holding all weight tensors. Tensors for GPU-offloaded layers reside in device memory; others in CPU memory. |
| model.buffer_kv | ggml_backend_buffer_t |
Pre-allocated buffer for the key-value cache used during autoregressive generation. |
| model.backend | ggml_backend_t |
Handle to the selected compute backend (GPU or CPU). |
| model.hparams | struct | Parsed hyperparameters (n_vocab, n_ctx, n_embd, n_head, n_layer, ftype).
|
| model.tensors | named tensor map | All weight tensors (embeddings, layer norms, attention projections, MLP weights) accessible by name. |
| vocab.token_to_id | std::map<std::string, int32_t> |
Forward mapping from token string to integer ID. |
| vocab.id_to_token | std::map<int32_t, std::string> |
Reverse mapping from integer ID to token string. |
Code Reference
Source Location
- Repository: GGML
- File:
examples/gpt-2/main-backend.cpp - Lines: 103-443
Loading Sequence (Pseudocode)
1. Open binary file at fname
2. Read and validate magic number (0x67676d6c)
3. Read hyperparameters into model.hparams
4. Override n_ctx if caller-supplied value > 0
5. Read n_vocab tokens into vocab maps
6. Determine ftype -> ggml_type mapping for weight precision
7. Create ggml_context for tensor metadata (no_alloc = true)
8. Define all model tensors (wte, wpe, per-layer weights, ln_f)
9. Initialize backend (GPU if n_gpu_layers > 0, else CPU)
10. Allocate buffer_w on backend, assign tensors to buffer
11. For each tensor in file:
a. Read tensor header (n_dims, name_length, ftype)
b. Read dimension sizes and tensor name
c. Look up corresponding ggml_tensor by name
d. Read raw data into backend buffer via ggml_backend_tensor_set
12. Allocate buffer_kv for KV cache
13. Return true on success
Usage Example
#include "ggml.h"
#include "ggml-backend.h"
#include <string>
// Model and vocabulary structs (defined in example)
gpt2_model model;
gpt_vocab vocab;
// Load GPT-2 model from GGML binary
// - Use default context size from file (n_ctx = 0)
// - Offload first 6 layers to GPU
bool ok = gpt2_model_load("models/gpt-2-117M/ggml-model.bin",
model, vocab,
/*n_ctx=*/0,
/*n_gpu_layers=*/6);
if (!ok) {
fprintf(stderr, "Failed to load model\n");
return 1;
}
// model.backend, model.buffer_w, model.buffer_kv are now ready
// vocab.token_to_id and vocab.id_to_token are populated
// Proceed to build computation graph and run inference...