Implementation:Ggml org Llama cpp Llama Decode
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| ggml-org/llama.cpp | Transformer Forward Pass, Batch Processing, KV Cache Management | 2026-02-14 |
Overview
Description
llama_decode performs a forward pass through the transformer model for a batch of tokens. It updates the KV cache with new key-value pairs and computes output logits for positions marked in the batch's output mask. This is the primary compute function in the inference pipeline -- every generated token requires exactly one llama_decode call.
The function requires the context to have an initialized memory (KV cache). For encoder-decoder models, it processes the batch using the decoder. For encoder processing (which does not use a KV cache), use llama_encode instead.
Usage
// Process prompt tokens
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
if (llama_decode(ctx, batch)) {
fprintf(stderr, "Failed to decode\n");
return 1;
}
// Process single generated token
llama_token new_token = llama_sampler_sample(smpl, ctx, -1);
batch = llama_batch_get_one(&new_token, 1);
if (llama_decode(ctx, batch)) {
fprintf(stderr, "Failed to decode\n");
return 1;
}
Code Reference
Source Location
| File | Line(s) | Type |
|---|---|---|
include/llama.h |
930-932 | Declaration |
src/llama-context.cpp |
3466-3475 | Implementation |
Signature
LLAMA_API int32_t llama_decode(
struct llama_context * ctx,
struct llama_batch batch);
Import
#include "llama.h"
Supporting Types
llama_batch
The llama_batch struct (defined at include/llama.h:231-240) describes the input batch:
typedef struct llama_batch {
int32_t n_tokens;
llama_token * token; // token IDs (used when embd is NULL)
float * embd; // token embeddings (used when token is NULL)
llama_pos * pos; // position of each token (NULL = auto-tracked)
int32_t * n_seq_id; // number of sequence IDs per token
llama_seq_id ** seq_id; // sequence IDs for each token (NULL = seq 0)
int8_t * logits; // output mask: nonzero = produce logits for this position
// (NULL: only last token for generation, all for embeddings)
} llama_batch;
llama_batch_get_one
A convenience function that creates a simple batch for single-sequence processing:
// Declaration at include/llama.h:889-891
LLAMA_API struct llama_batch llama_batch_get_one(
llama_token * tokens,
int32_t n_tokens);
This creates a batch where:
- All tokens belong to sequence ID 0
- Positions are tracked automatically by the context
- Only the last token produces output logits (appropriate for autoregressive generation)
For more complex scenarios (multi-sequence, explicit positions, custom output masks), use llama_batch_init to allocate a fully configurable batch:
// Allocate batch with room for n_tokens, no embeddings, up to n_seq_max sequences per token
LLAMA_API struct llama_batch llama_batch_init(int32_t n_tokens, int32_t embd, int32_t n_seq_max);
// Free a batch allocated with llama_batch_init
LLAMA_API void llama_batch_free(struct llama_batch batch);
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
ctx |
struct llama_context * |
Inference context with initialized KV cache. Must have been created with llama_init_from_model.
|
batch |
struct llama_batch |
Batch of tokens (or embeddings) to process. n_tokens must not exceed the context's n_batch setting.
|
Outputs
| Return | Type | Description |
|---|---|---|
| 0 | int32_t |
Success. KV cache updated, logits available for positions marked in the output mask. |
| 1 | int32_t |
Warning: could not find a KV slot for the batch. Try reducing batch size or increasing context size (n_ctx). Memory state is restored.
|
| 2 | int32_t |
Aborted by the abort callback. Processed micro-batches remain in memory. |
| -1 | int32_t |
Invalid input batch. Memory state is restored. |
| < -1 | int32_t |
Fatal error. Processed micro-batches remain in memory. |
Side Effects
- Updates the KV cache with new key-value pairs for the processed tokens
- Computes and stores logits accessible via
llama_get_logits(ctx)orllama_get_logits_ith(ctx, idx) - Advances internal position tracking for auto-positioned batches
- Updates performance counters (if
no_perfis false)
Usage Examples
Complete Generation Loop (from examples/simple/simple.cpp)
// Prepare a batch for the prompt
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
// Handle encoder-decoder models
if (llama_model_has_encoder(model)) {
if (llama_encode(ctx, batch)) {
fprintf(stderr, "failed to encode\n");
return 1;
}
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
if (decoder_start_token_id == LLAMA_TOKEN_NULL) {
decoder_start_token_id = llama_vocab_bos(vocab);
}
batch = llama_batch_get_one(&decoder_start_token_id, 1);
}
// Main generation loop
int n_decode = 0;
llama_token new_token_id;
for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
// Forward pass through the transformer
if (llama_decode(ctx, batch)) {
fprintf(stderr, "failed to decode\n");
return 1;
}
n_pos += batch.n_tokens;
// Sample the next token from the logits
new_token_id = llama_sampler_sample(smpl, ctx, -1);
// Check for end of generation
if (llama_vocab_is_eog(vocab, new_token_id)) {
break;
}
// Convert token to text and print
char buf[128];
int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
std::string s(buf, n);
printf("%s", s.c_str());
fflush(stdout);
// Prepare the next batch with just the sampled token
batch = llama_batch_get_one(&new_token_id, 1);
n_decode += 1;
}
Using llama_batch_init for Multi-Sequence Batches
// Allocate a batch that can hold up to 512 tokens, each in up to 4 sequences
llama_batch batch = llama_batch_init(512, 0, 4);
// Fill in tokens manually
batch.n_tokens = 3;
batch.token[0] = token_a; batch.pos[0] = 0; batch.logits[0] = 0;
batch.token[1] = token_b; batch.pos[1] = 1; batch.logits[1] = 0;
batch.token[2] = token_c; batch.pos[2] = 2; batch.logits[2] = 1; // only produce logits here
// Set sequence IDs
batch.n_seq_id[0] = 1; batch.seq_id[0][0] = 0;
batch.n_seq_id[1] = 1; batch.seq_id[1][0] = 0;
batch.n_seq_id[2] = 1; batch.seq_id[2][0] = 0;
int result = llama_decode(ctx, batch);
// Free when done
llama_batch_free(batch);
Related Pages
- Principle:Ggml_org_Llama_cpp_Batch_Decoding
- Implementation:Ggml_org_Llama_cpp_Llama_Init_From_Model -- creates the context used by decode
- Implementation:Ggml_org_Llama_cpp_Llama_Tokenize -- produces the token IDs fed into decode
- Implementation:Ggml_org_Llama_cpp_Llama_Sampler_Sample -- consumes the logits produced by decode
- Environment:Ggml_org_Llama_cpp_CUDA_GPU_Environment
- Environment:Ggml_org_Llama_cpp_Metal_GPU_Environment
- Environment:Ggml_org_Llama_cpp_Vulkan_GPU_Environment
- Heuristic:Ggml_org_Llama_cpp_Thread_Count_Tuning
- Heuristic:Ggml_org_Llama_cpp_Batch_Size_BLAS_Minimum