Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Ggml Gpt2 graph

From Leeroopedia


Attribute Value
Page Type Implementation
Full Name Ggml_org_Ggml_Gpt2_graph
Short Name Gpt2_graph
Repository https://github.com/ggml-org/ggml
Language C++
Domain Tags NLP, Transformer_Architecture
Knowledge Sources
Last Updated 2025-05-15 12:00 GMT

Overview

Description

gpt2_graph constructs a full GPT-2 forward-pass computation graph from a loaded model and inference parameters. It translates the Transformer architecture — embedding, positional encoding, repeated self-attention and MLP blocks with residual connections, and a language-model head — into a GGML computation graph that can be dispatched to any backend for execution.

The function builds the entire forward pass as a directed acyclic graph of GGML tensor operations. The graph is built once per inference call and encodes every operation from token embedding lookup through to the final logit projection. Input tensors (token IDs and position offsets) are embedded in the graph as named input nodes, allowing the caller to populate them before each execution.

Usage

gpt2_graph is called once per generation step in the GPT-2 inference loop. The caller provides the loaded model, the number of previously generated tokens (for KV cache positioning), and the number of new tokens to process. The returned graph is then allocated and executed via the GGML backend scheduler.

Code Reference

Source Location

Attribute Value
File examples/gpt-2/main-backend.cpp
Lines L446-721
Repository https://github.com/ggml-org/ggml

Signature

struct ggml_cgraph * gpt2_graph(
    const gpt2_model & model,
    const int          n_past,
    const int          n_tokens);

Import

#include "ggml.h"
#include "ggml-backend.h"

Dependencies: ggml.h, ggml-backend.h

I/O Contract

Inputs

Parameter Type Description
model const gpt2_model & A loaded GPT-2 model containing all weight tensors (token embeddings, positional embeddings, per-layer attention and MLP weights, layer normalization parameters, and the language-model head projection). Also carries the model hyperparameters (number of layers, heads, embedding dimension, vocabulary size).
n_past const int Context position offset for the KV cache. Indicates how many tokens have been previously processed and stored in the key-value cache. New key and value entries are written starting at this offset.
n_tokens const int Number of new tokens to process in this forward pass. During prompt ingestion this may be the full prompt length; during autoregressive generation this is typically 1.

Outputs

Return Type Description
computation graph struct ggml_cgraph * A complete computation graph encoding the GPT-2 forward pass. The graph's terminal tensor is the logits tensor of shape [n_vocab, n_tokens]. The graph also contains two named input tensors: "embd" (token IDs, 1-D integer tensor of length n_tokens) and "position" (position indices, 1-D integer tensor of length n_tokens). The caller must populate these inputs before executing the graph.

Graph Construction Detail

The function builds the computation graph in the following stages:

1. Input Embedding

Token IDs and position indices are created as named input tensors. The token embedding and positional embedding vectors are looked up via ggml_get_rows and summed with ggml_add to produce the initial hidden state (inpL).

2. Transformer Blocks (repeated per layer)

For each of the model's n_layer Transformer blocks, the following operations are applied:

  1. LayerNorm 1 -- The hidden state is normalized using ggml_norm, then scaled and shifted by learned parameters via ggml_mul and ggml_add.
  2. QKV Projection -- The normalized hidden state is projected through the combined QKV weight matrix using ggml_mul_mat with a bias addition. The result is split into separate Q, K, and V tensors using ggml_view_2d.
  3. Head Reshaping -- Q, K, and V are reshaped into per-head 3-D tensors via ggml_cont_3d and reordered with ggml_permute to place the head dimension first.
  4. KV Cache Update -- The current K and V tensors are written into the KV cache at position n_past using ggml_cpy. Full cached K and V tensors are then read back as views for the attention computation.
  5. Scaled Dot-Product Attention -- Attention scores are computed via ggml_mul_mat (Q * K^T), scaled by 1/sqrt(d_head) with ggml_scale, causally masked with ggml_diag_mask_inf, normalized with ggml_soft_max, and multiplied by V via ggml_mul_mat.
  6. Output Projection -- Attention heads are concatenated, permuted back to the original layout, and projected through the output weight matrix with ggml_mul_mat.
  7. Residual Connection 1 -- The attention output is added to the original hidden state via ggml_add.
  8. LayerNorm 2 -- The updated hidden state is normalized again with ggml_norm, ggml_mul, and ggml_add.
  9. MLP (Feed-Forward Network) -- A fully connected layer (ggml_mul_mat + bias) expands the hidden dimension, followed by ggml_gelu activation, then a second fully connected layer projects back to the model dimension.
  10. Residual Connection 2 -- The MLP output is added to the post-attention hidden state via ggml_add.

3. Final Projection

After all Transformer blocks, a final LayerNorm is applied to the output hidden state. The normalized output is then projected through the language-model head weight matrix using ggml_mul_mat to produce logits over the full vocabulary.

4. Graph Finalization

ggml_build_forward_expand is called on the logits tensor, which recursively walks the dependency tree and populates the computation graph with all required nodes and leaves.

GGML Operations Used

Operation Purpose
ggml_get_rows Embedding lookup (token and position embeddings)
ggml_add Bias addition, residual connections, embedding summation
ggml_mul Element-wise scaling in layer normalization
ggml_norm Layer normalization (mean/variance computation)
ggml_mul_mat Linear projections (QKV, attention output, MLP layers, LM head)
ggml_view_2d Splitting combined QKV tensor into separate Q, K, V
ggml_permute Reordering tensor dimensions for multi-head layout
ggml_cont_3d Making contiguous 3-D copies for head reshaping
ggml_cpy Writing K and V into the KV cache
ggml_scale Scaling attention scores by 1/sqrt(d_head)
ggml_diag_mask_inf Applying causal mask (upper triangle set to -inf)
ggml_soft_max Softmax normalization of attention weights
ggml_gelu GELU activation in the MLP sublayer
ggml_build_forward_expand Finalizing the computation graph from the terminal tensor

Usage Example

// Build the computation graph for one generation step
struct ggml_cgraph * gf = gpt2_graph(model, n_past, n_tokens);

// Allocate the graph on the backend scheduler
ggml_backend_sched_alloc_graph(sched, gf);

// Set input tensors
struct ggml_tensor * embd_tensor = ggml_graph_get_tensor(gf, "embd");
struct ggml_tensor * pos_tensor  = ggml_graph_get_tensor(gf, "position");
ggml_backend_tensor_set(embd_tensor, token_ids, 0, n_tokens * sizeof(int32_t));
ggml_backend_tensor_set(pos_tensor,  positions,  0, n_tokens * sizeof(int32_t));

// Execute the graph
ggml_backend_sched_graph_compute(sched, gf);

// Read logits from the terminal tensor
struct ggml_tensor * logits = ggml_graph_get_tensor(gf, "logits");
ggml_backend_tensor_get(logits, logits_buf, 0, n_vocab * n_tokens * sizeof(float));

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment