Implementation:Ggml org Ggml Gpt2 graph
| Attribute | Value |
|---|---|
| Page Type | Implementation |
| Full Name | Ggml_org_Ggml_Gpt2_graph |
| Short Name | Gpt2_graph |
| Repository | https://github.com/ggml-org/ggml |
| Language | C++ |
| Domain Tags | NLP, Transformer_Architecture |
| Knowledge Sources | |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Description
gpt2_graph constructs a full GPT-2 forward-pass computation graph from a loaded model and inference parameters. It translates the Transformer architecture — embedding, positional encoding, repeated self-attention and MLP blocks with residual connections, and a language-model head — into a GGML computation graph that can be dispatched to any backend for execution.
The function builds the entire forward pass as a directed acyclic graph of GGML tensor operations. The graph is built once per inference call and encodes every operation from token embedding lookup through to the final logit projection. Input tensors (token IDs and position offsets) are embedded in the graph as named input nodes, allowing the caller to populate them before each execution.
Usage
gpt2_graph is called once per generation step in the GPT-2 inference loop. The caller provides the loaded model, the number of previously generated tokens (for KV cache positioning), and the number of new tokens to process. The returned graph is then allocated and executed via the GGML backend scheduler.
Code Reference
Source Location
| Attribute | Value |
|---|---|
| File | examples/gpt-2/main-backend.cpp
|
| Lines | L446-721 |
| Repository | https://github.com/ggml-org/ggml |
Signature
struct ggml_cgraph * gpt2_graph(
const gpt2_model & model,
const int n_past,
const int n_tokens);
Import
#include "ggml.h" #include "ggml-backend.h"
Dependencies: ggml.h, ggml-backend.h
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
| model | const gpt2_model & |
A loaded GPT-2 model containing all weight tensors (token embeddings, positional embeddings, per-layer attention and MLP weights, layer normalization parameters, and the language-model head projection). Also carries the model hyperparameters (number of layers, heads, embedding dimension, vocabulary size). |
| n_past | const int |
Context position offset for the KV cache. Indicates how many tokens have been previously processed and stored in the key-value cache. New key and value entries are written starting at this offset. |
| n_tokens | const int |
Number of new tokens to process in this forward pass. During prompt ingestion this may be the full prompt length; during autoregressive generation this is typically 1. |
Outputs
| Return | Type | Description |
|---|---|---|
| computation graph | struct ggml_cgraph * |
A complete computation graph encoding the GPT-2 forward pass. The graph's terminal tensor is the logits tensor of shape [n_vocab, n_tokens]. The graph also contains two named input tensors: "embd" (token IDs, 1-D integer tensor of length n_tokens) and "position" (position indices, 1-D integer tensor of length n_tokens). The caller must populate these inputs before executing the graph.
|
Graph Construction Detail
The function builds the computation graph in the following stages:
1. Input Embedding
Token IDs and position indices are created as named input tensors. The token embedding and positional embedding vectors are looked up via ggml_get_rows and summed with ggml_add to produce the initial hidden state (inpL).
2. Transformer Blocks (repeated per layer)
For each of the model's n_layer Transformer blocks, the following operations are applied:
- LayerNorm 1 -- The hidden state is normalized using
ggml_norm, then scaled and shifted by learned parameters viaggml_mulandggml_add. - QKV Projection -- The normalized hidden state is projected through the combined QKV weight matrix using
ggml_mul_matwith a bias addition. The result is split into separate Q, K, and V tensors usingggml_view_2d. - Head Reshaping -- Q, K, and V are reshaped into per-head 3-D tensors via
ggml_cont_3dand reordered withggml_permuteto place the head dimension first. - KV Cache Update -- The current K and V tensors are written into the KV cache at position n_past using
ggml_cpy. Full cached K and V tensors are then read back as views for the attention computation. - Scaled Dot-Product Attention -- Attention scores are computed via
ggml_mul_mat(Q * K^T), scaled by 1/sqrt(d_head) withggml_scale, causally masked withggml_diag_mask_inf, normalized withggml_soft_max, and multiplied by V viaggml_mul_mat. - Output Projection -- Attention heads are concatenated, permuted back to the original layout, and projected through the output weight matrix with
ggml_mul_mat. - Residual Connection 1 -- The attention output is added to the original hidden state via
ggml_add. - LayerNorm 2 -- The updated hidden state is normalized again with
ggml_norm,ggml_mul, andggml_add. - MLP (Feed-Forward Network) -- A fully connected layer (
ggml_mul_mat+ bias) expands the hidden dimension, followed byggml_geluactivation, then a second fully connected layer projects back to the model dimension. - Residual Connection 2 -- The MLP output is added to the post-attention hidden state via
ggml_add.
3. Final Projection
After all Transformer blocks, a final LayerNorm is applied to the output hidden state. The normalized output is then projected through the language-model head weight matrix using ggml_mul_mat to produce logits over the full vocabulary.
4. Graph Finalization
ggml_build_forward_expand is called on the logits tensor, which recursively walks the dependency tree and populates the computation graph with all required nodes and leaves.
GGML Operations Used
| Operation | Purpose |
|---|---|
ggml_get_rows |
Embedding lookup (token and position embeddings) |
ggml_add |
Bias addition, residual connections, embedding summation |
ggml_mul |
Element-wise scaling in layer normalization |
ggml_norm |
Layer normalization (mean/variance computation) |
ggml_mul_mat |
Linear projections (QKV, attention output, MLP layers, LM head) |
ggml_view_2d |
Splitting combined QKV tensor into separate Q, K, V |
ggml_permute |
Reordering tensor dimensions for multi-head layout |
ggml_cont_3d |
Making contiguous 3-D copies for head reshaping |
ggml_cpy |
Writing K and V into the KV cache |
ggml_scale |
Scaling attention scores by 1/sqrt(d_head) |
ggml_diag_mask_inf |
Applying causal mask (upper triangle set to -inf) |
ggml_soft_max |
Softmax normalization of attention weights |
ggml_gelu |
GELU activation in the MLP sublayer |
ggml_build_forward_expand |
Finalizing the computation graph from the terminal tensor |
Usage Example
// Build the computation graph for one generation step struct ggml_cgraph * gf = gpt2_graph(model, n_past, n_tokens); // Allocate the graph on the backend scheduler ggml_backend_sched_alloc_graph(sched, gf); // Set input tensors struct ggml_tensor * embd_tensor = ggml_graph_get_tensor(gf, "embd"); struct ggml_tensor * pos_tensor = ggml_graph_get_tensor(gf, "position"); ggml_backend_tensor_set(embd_tensor, token_ids, 0, n_tokens * sizeof(int32_t)); ggml_backend_tensor_set(pos_tensor, positions, 0, n_tokens * sizeof(int32_t)); // Execute the graph ggml_backend_sched_graph_compute(sched, gf); // Read logits from the terminal tensor struct ggml_tensor * logits = ggml_graph_get_tensor(gf, "logits"); ggml_backend_tensor_get(logits, logits_buf, 0, n_vocab * n_tokens * sizeof(float));