Principle:Ggml org Ggml Transformer Graph Construction
| Knowledge Sources | |
|---|---|
| Domains | NLP, Transformer_Architecture |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Transformer Graph Construction is the process of building a complete computation graph for a Transformer architecture from a set of loaded model weights.
Description
A Transformer model is expressed in GGML as a single computation graph (ggml_cgraph) that encodes every tensor operation from token embedding through to the final logit projection. The graph is constructed once per inference call and captures the full forward pass of the network: embedding lookup, positional encoding, a repeated stack of Transformer blocks, and a language-model head that produces next-token logits.
The standard Transformer block follows this structure:
- LayerNorm -- Normalize the residual stream.
- Multi-Head Self-Attention -- Project the normalized activations into queries, keys, and values; compute scaled dot-product attention with causal masking; concatenate heads and project back to the model dimension.
- Residual Connection -- Add the attention output to the original residual stream.
- LayerNorm -- Normalize the updated residual stream.
- MLP (Feed-Forward Network) -- Apply a fully connected layer, a GELU activation, and a second fully connected projection.
- Residual Connection -- Add the MLP output to the residual stream.
After all blocks have been applied, a final LayerNorm and a linear projection (the LM head) produce per-vocabulary logits. The graph is built once and re-executed per token generation step, with only the input tensors (token IDs and position offsets) changing between steps.
Usage
Apply this principle whenever a Transformer-based language model must be expressed as a GGML computation graph. Build the graph by chaining GGML operation helpers in the order prescribed by the architecture (embedding, positional encoding, repeated attention + MLP blocks, final projection), then finalize with ggml_build_forward_expand and dispatch to a backend for execution.
Theoretical Basis
Transformer Architecture (Vaswani et al. 2017)
The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), replaced recurrence with a purely attention-based mechanism. The architecture stacks identical blocks, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer, with residual connections and layer normalization around each. This design enables massive parallelism during training and forms the foundation of modern large language models.
Self-Attention Mechanism
Scaled dot-product attention computes a weighted sum of value vectors, where the weights are determined by the compatibility (dot product) between query and key vectors:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
Multi-head attention runs this operation in parallel across multiple "heads", each operating on a different learned linear projection of the input. The outputs of all heads are concatenated and linearly projected to produce the final attention output. This allows the model to jointly attend to information from different representation subspaces at different positions.
KV Cache for Autoregressive Generation
During autoregressive text generation, the model produces one token at a time. Recomputing attention over the entire sequence at each step would be prohibitively expensive. The KV cache stores the key and value projections of all previously processed tokens. At each generation step, only the new token's key and value are computed and appended to the cache, while the full cached keys and values are used for the attention computation. The n_past parameter tracks how many tokens are already in the cache, determining where new entries are written.
Causal Masking
In decoder-only models (such as GPT-2), each token must attend only to itself and to preceding tokens, not to future tokens. This is enforced by applying a causal mask — an upper-triangular matrix of negative infinity values — to the attention scores before the softmax. In GGML, this is implemented via ggml_diag_mask_inf.
Key Operations
The following operations are central to constructing a Transformer computation graph in GGML:
- Embedding lookup (
ggml_get_rows) -- Retrieves token embedding vectors from the weight matrix using token IDs as indices. - Positional encoding (
ggml_get_rows) -- Retrieves learned positional embedding vectors and adds them to the token embeddings. - QKV projection (
ggml_mul_mat) -- Multiplies the normalized input by the combined query-key-value weight matrix, then splits the result into separate Q, K, and V tensors via views and permutations. - Scaled dot-product attention (
ggml_mul_mat,ggml_scale,ggml_diag_mask_inf,ggml_soft_max) -- Computes attention scores, applies causal masking, normalizes with softmax, and produces the weighted value sum. - Feed-forward network (
ggml_mul_mat,ggml_gelu) -- Two linear transformations with a GELU activation in between. - Graph finalization (
ggml_build_forward_expand) -- Registers the terminal tensor (logits) and recursively walks its dependency tree to populate the computation graph.