Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml_org_Ggml_Gpt2_model_load

From Leeroopedia


Template:Implementation

Summary

gpt2_model_load is a C++ function defined in the GGML GPT-2 example that reads a GGML-format binary file containing a pre-trained GPT-2 model and populates in-memory structures with hyperparameters, BPE vocabulary, and weight tensors distributed across CPU and (optionally) GPU backends. It is the concrete loader that bridges a serialized GPT-2 checkpoint and the GGML inference runtime.

Import

N/A (defined in example, not a library export)

The function is defined directly in the example application source and is not part of the public GGML library API.

Dependencies

  • ggml.h -- core tensor types, context management, and tensor creation functions.
  • ggml-cpu.h -- CPU backend initialization and buffer allocation.
  • ggml-alloc.h -- tensor memory allocation helpers (ggml_tallocr).
  • ggml-backend.h -- backend-agnostic buffer and tensor transfer API.
  • ggml-cuda.h (optional) -- CUDA backend for NVIDIA GPU offloading.
  • ggml-metal.h (optional) -- Metal backend for Apple GPU offloading.

Function Signature

bool gpt2_model_load(
    const std::string & fname,
    gpt2_model        & model,
    gpt_vocab         & vocab,
    int                 n_ctx,
    int                 n_gpu_layers);

Source: examples/gpt-2/main-backend.cpp:L103-443

Parameters

Parameter Type Description
fname const std::string & Path to the GGML binary file containing the serialized GPT-2 model weights, vocabulary, and hyperparameters.
model gpt2_model & Output struct that receives all loaded weight tensors (stored in buffer_w), the KV cache allocation (stored in buffer_kv), and the backend handle.
vocab gpt_vocab & Output vocabulary struct populated with bidirectional token maps (string-to-id and id-to-string) for the BPE tokenizer.
n_ctx int Context size override. If greater than zero, replaces the context length read from the file header, allowing the caller to increase or decrease the sequence length at load time.
n_gpu_layers int Number of transformer layers to offload to the GPU backend. Layers with index less than this value are placed in GPU memory; remaining layers stay on the CPU.

Return Value

Returns bool:

  • true -- the model was loaded successfully. The model struct contains fully populated weight tensors in buffer_w, a pre-allocated KV cache in buffer_kv, and a valid backend handle. The vocab struct contains the bidirectional token maps.
  • false -- loading failed (e.g., file not found, magic number mismatch, read error, or unsupported format).

I/O Contract

Inputs (File Format)

The function expects a GGML binary file with the following sequential layout:

Section Format Details
Magic number 4 bytes, little-endian Must equal 0x67676d6c (ASCII "ggml").
Hyperparameters 7 x int32_t n_vocab, n_ctx, n_embd, n_head, n_layer, ftype (and padding/version fields as applicable).
Vocabulary Repeated: 4-byte length + UTF-8 bytes n_vocab BPE tokens, each preceded by a 32-bit length.
Weight tensors Repeated: n_dims, name_length, ftype, dims[], name, data Each tensor record encodes its dimensionality, name, quantization type, shape, and raw weight bytes.

Outputs (In-Memory Structures)

Output Type Contents
model.buffer_w ggml_backend_buffer_t Backend buffer holding all weight tensors. Tensors for GPU-offloaded layers reside in device memory; others in CPU memory.
model.buffer_kv ggml_backend_buffer_t Pre-allocated buffer for the key-value cache used during autoregressive generation.
model.backend ggml_backend_t Handle to the selected compute backend (GPU or CPU).
model.hparams struct Parsed hyperparameters (n_vocab, n_ctx, n_embd, n_head, n_layer, ftype).
model.tensors named tensor map All weight tensors (embeddings, layer norms, attention projections, MLP weights) accessible by name.
vocab.token_to_id std::map<std::string, int32_t> Forward mapping from token string to integer ID.
vocab.id_to_token std::map<int32_t, std::string> Reverse mapping from integer ID to token string.

Code Reference

Source Location

  • Repository: GGML
  • File: examples/gpt-2/main-backend.cpp
  • Lines: 103-443

Loading Sequence (Pseudocode)

1. Open binary file at fname
2. Read and validate magic number (0x67676d6c)
3. Read hyperparameters into model.hparams
4. Override n_ctx if caller-supplied value > 0
5. Read n_vocab tokens into vocab maps
6. Determine ftype -> ggml_type mapping for weight precision
7. Create ggml_context for tensor metadata (no_alloc = true)
8. Define all model tensors (wte, wpe, per-layer weights, ln_f)
9. Initialize backend (GPU if n_gpu_layers > 0, else CPU)
10. Allocate buffer_w on backend, assign tensors to buffer
11. For each tensor in file:
      a. Read tensor header (n_dims, name_length, ftype)
      b. Read dimension sizes and tensor name
      c. Look up corresponding ggml_tensor by name
      d. Read raw data into backend buffer via ggml_backend_tensor_set
12. Allocate buffer_kv for KV cache
13. Return true on success

Usage Example

#include "ggml.h"
#include "ggml-backend.h"
#include <string>

// Model and vocabulary structs (defined in example)
gpt2_model model;
gpt_vocab  vocab;

// Load GPT-2 model from GGML binary
// - Use default context size from file (n_ctx = 0)
// - Offload first 6 layers to GPU
bool ok = gpt2_model_load("models/gpt-2-117M/ggml-model.bin",
                           model, vocab,
                           /*n_ctx=*/0,
                           /*n_gpu_layers=*/6);
if (!ok) {
    fprintf(stderr, "Failed to load model\n");
    return 1;
}

// model.backend, model.buffer_w, model.buffer_kv are now ready
// vocab.token_to_id and vocab.id_to_token are populated
// Proceed to build computation graph and run inference...

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment