Implementation:Ggml org Ggml Gpt2 graph

Attribute	Value
Page Type	Implementation
Full Name	Ggml_org_Ggml_Gpt2_graph
Short Name	Gpt2_graph
Repository	https://github.com/ggml-org/ggml
Language	C++
Domain Tags	NLP, Transformer_Architecture
Knowledge Sources	GGML Attention Is All You Need
Last Updated	2025-05-15 12:00 GMT

Overview

Description

gpt2_graph constructs a full GPT-2 forward-pass computation graph from a loaded model and inference parameters. It translates the Transformer architecture — embedding, positional encoding, repeated self-attention and MLP blocks with residual connections, and a language-model head — into a GGML computation graph that can be dispatched to any backend for execution.

The function builds the entire forward pass as a directed acyclic graph of GGML tensor operations. The graph is built once per inference call and encodes every operation from token embedding lookup through to the final logit projection. Input tensors (token IDs and position offsets) are embedded in the graph as named input nodes, allowing the caller to populate them before each execution.

Usage

gpt2_graph is called once per generation step in the GPT-2 inference loop. The caller provides the loaded model, the number of previously generated tokens (for KV cache positioning), and the number of new tokens to process. The returned graph is then allocated and executed via the GGML backend scheduler.

Code Reference

Source Location

Attribute	Value
File	`examples/gpt-2/main-backend.cpp`
Lines	L446-721
Repository	https://github.com/ggml-org/ggml

Signature

struct ggml_cgraph * gpt2_graph(
    const gpt2_model & model,
    const int          n_past,
    const int          n_tokens);

Import

#include "ggml.h"
#include "ggml-backend.h"

Dependencies: ggml.h, ggml-backend.h

I/O Contract

Inputs

Parameter	Type	Description
model	`const gpt2_model &`	A loaded GPT-2 model containing all weight tensors (token embeddings, positional embeddings, per-layer attention and MLP weights, layer normalization parameters, and the language-model head projection). Also carries the model hyperparameters (number of layers, heads, embedding dimension, vocabulary size).
n_past	`const int`	Context position offset for the KV cache. Indicates how many tokens have been previously processed and stored in the key-value cache. New key and value entries are written starting at this offset.
n_tokens	`const int`	Number of new tokens to process in this forward pass. During prompt ingestion this may be the full prompt length; during autoregressive generation this is typically 1.

Outputs

Return	Type	Description
computation graph	`struct ggml_cgraph *`	A complete computation graph encoding the GPT-2 forward pass. The graph's terminal tensor is the logits tensor of shape `[n_vocab, n_tokens]`. The graph also contains two named input tensors: "embd" (token IDs, 1-D integer tensor of length `n_tokens`) and "position" (position indices, 1-D integer tensor of length `n_tokens`). The caller must populate these inputs before executing the graph.

Graph Construction Detail

The function builds the computation graph in the following stages:

1. Input Embedding

Token IDs and position indices are created as named input tensors. The token embedding and positional embedding vectors are looked up via ggml_get_rows and summed with ggml_add to produce the initial hidden state (inpL).

2. Transformer Blocks (repeated per layer)

For each of the model's n_layer Transformer blocks, the following operations are applied:

LayerNorm 1 -- The hidden state is normalized using ggml_norm, then scaled and shifted by learned parameters via ggml_mul and ggml_add.
QKV Projection -- The normalized hidden state is projected through the combined QKV weight matrix using ggml_mul_mat with a bias addition. The result is split into separate Q, K, and V tensors using ggml_view_2d.
Head Reshaping -- Q, K, and V are reshaped into per-head 3-D tensors via ggml_cont_3d and reordered with ggml_permute to place the head dimension first.
KV Cache Update -- The current K and V tensors are written into the KV cache at position n_past using ggml_cpy. Full cached K and V tensors are then read back as views for the attention computation.
Scaled Dot-Product Attention -- Attention scores are computed via ggml_mul_mat (Q * K^T), scaled by 1/sqrt(d_head) with ggml_scale, causally masked with ggml_diag_mask_inf, normalized with ggml_soft_max, and multiplied by V via ggml_mul_mat.
Output Projection -- Attention heads are concatenated, permuted back to the original layout, and projected through the output weight matrix with ggml_mul_mat.
Residual Connection 1 -- The attention output is added to the original hidden state via ggml_add.
LayerNorm 2 -- The updated hidden state is normalized again with ggml_norm, ggml_mul, and ggml_add.
MLP (Feed-Forward Network) -- A fully connected layer (ggml_mul_mat + bias) expands the hidden dimension, followed by ggml_gelu activation, then a second fully connected layer projects back to the model dimension.
Residual Connection 2 -- The MLP output is added to the post-attention hidden state via ggml_add.

3. Final Projection

After all Transformer blocks, a final LayerNorm is applied to the output hidden state. The normalized output is then projected through the language-model head weight matrix using ggml_mul_mat to produce logits over the full vocabulary.

4. Graph Finalization

ggml_build_forward_expand is called on the logits tensor, which recursively walks the dependency tree and populates the computation graph with all required nodes and leaves.

GGML Operations Used

Operation	Purpose
`ggml_get_rows`	Embedding lookup (token and position embeddings)
`ggml_add`	Bias addition, residual connections, embedding summation
`ggml_mul`	Element-wise scaling in layer normalization
`ggml_norm`	Layer normalization (mean/variance computation)
`ggml_mul_mat`	Linear projections (QKV, attention output, MLP layers, LM head)
`ggml_view_2d`	Splitting combined QKV tensor into separate Q, K, V
`ggml_permute`	Reordering tensor dimensions for multi-head layout
`ggml_cont_3d`	Making contiguous 3-D copies for head reshaping
`ggml_cpy`	Writing K and V into the KV cache
`ggml_scale`	Scaling attention scores by 1/sqrt(d_head)
`ggml_diag_mask_inf`	Applying causal mask (upper triangle set to -inf)
`ggml_soft_max`	Softmax normalization of attention weights
`ggml_gelu`	GELU activation in the MLP sublayer
`ggml_build_forward_expand`	Finalizing the computation graph from the terminal tensor

Usage Example

// Build the computation graph for one generation step
struct ggml_cgraph * gf = gpt2_graph(model, n_past, n_tokens);

// Allocate the graph on the backend scheduler
ggml_backend_sched_alloc_graph(sched, gf);

// Set input tensors
struct ggml_tensor * embd_tensor = ggml_graph_get_tensor(gf, "embd");
struct ggml_tensor * pos_tensor  = ggml_graph_get_tensor(gf, "position");
ggml_backend_tensor_set(embd_tensor, token_ids, 0, n_tokens * sizeof(int32_t));
ggml_backend_tensor_set(pos_tensor,  positions,  0, n_tokens * sizeof(int32_t));

// Execute the graph
ggml_backend_sched_graph_compute(sched, gf);

// Read logits from the terminal tensor
struct ggml_tensor * logits = ggml_graph_get_tensor(gf, "logits");
ggml_backend_tensor_get(logits, logits_buf, 0, n_vocab * n_tokens * sizeof(float));

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment