Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Decode

From Leeroopedia
Knowledge Sources Domains Last Updated
ggml-org/llama.cpp Transformer Forward Pass, Batch Processing, KV Cache Management 2026-02-14

Overview

Description

llama_decode performs a forward pass through the transformer model for a batch of tokens. It updates the KV cache with new key-value pairs and computes output logits for positions marked in the batch's output mask. This is the primary compute function in the inference pipeline -- every generated token requires exactly one llama_decode call.

The function requires the context to have an initialized memory (KV cache). For encoder-decoder models, it processes the batch using the decoder. For encoder processing (which does not use a KV cache), use llama_encode instead.

Usage

// Process prompt tokens
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
if (llama_decode(ctx, batch)) {
    fprintf(stderr, "Failed to decode\n");
    return 1;
}

// Process single generated token
llama_token new_token = llama_sampler_sample(smpl, ctx, -1);
batch = llama_batch_get_one(&new_token, 1);
if (llama_decode(ctx, batch)) {
    fprintf(stderr, "Failed to decode\n");
    return 1;
}

Code Reference

Source Location

File Line(s) Type
include/llama.h 930-932 Declaration
src/llama-context.cpp 3466-3475 Implementation

Signature

LLAMA_API int32_t llama_decode(
        struct llama_context * ctx,
          struct llama_batch   batch);

Import

#include "llama.h"

Supporting Types

llama_batch

The llama_batch struct (defined at include/llama.h:231-240) describes the input batch:

typedef struct llama_batch {
    int32_t n_tokens;

    llama_token  *  token;    // token IDs (used when embd is NULL)
    float        *  embd;     // token embeddings (used when token is NULL)
    llama_pos    *  pos;      // position of each token (NULL = auto-tracked)
    int32_t      *  n_seq_id; // number of sequence IDs per token
    llama_seq_id ** seq_id;   // sequence IDs for each token (NULL = seq 0)
    int8_t       *  logits;   // output mask: nonzero = produce logits for this position
                              // (NULL: only last token for generation, all for embeddings)
} llama_batch;

llama_batch_get_one

A convenience function that creates a simple batch for single-sequence processing:

// Declaration at include/llama.h:889-891
LLAMA_API struct llama_batch llama_batch_get_one(
              llama_token * tokens,
                  int32_t   n_tokens);

This creates a batch where:

  • All tokens belong to sequence ID 0
  • Positions are tracked automatically by the context
  • Only the last token produces output logits (appropriate for autoregressive generation)

For more complex scenarios (multi-sequence, explicit positions, custom output masks), use llama_batch_init to allocate a fully configurable batch:

// Allocate batch with room for n_tokens, no embeddings, up to n_seq_max sequences per token
LLAMA_API struct llama_batch llama_batch_init(int32_t n_tokens, int32_t embd, int32_t n_seq_max);

// Free a batch allocated with llama_batch_init
LLAMA_API void llama_batch_free(struct llama_batch batch);

I/O Contract

Inputs

Parameter Type Description
ctx struct llama_context * Inference context with initialized KV cache. Must have been created with llama_init_from_model.
batch struct llama_batch Batch of tokens (or embeddings) to process. n_tokens must not exceed the context's n_batch setting.

Outputs

Return Type Description
0 int32_t Success. KV cache updated, logits available for positions marked in the output mask.
1 int32_t Warning: could not find a KV slot for the batch. Try reducing batch size or increasing context size (n_ctx). Memory state is restored.
2 int32_t Aborted by the abort callback. Processed micro-batches remain in memory.
-1 int32_t Invalid input batch. Memory state is restored.
< -1 int32_t Fatal error. Processed micro-batches remain in memory.

Side Effects

  • Updates the KV cache with new key-value pairs for the processed tokens
  • Computes and stores logits accessible via llama_get_logits(ctx) or llama_get_logits_ith(ctx, idx)
  • Advances internal position tracking for auto-positioned batches
  • Updates performance counters (if no_perf is false)

Usage Examples

Complete Generation Loop (from examples/simple/simple.cpp)

// Prepare a batch for the prompt
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());

// Handle encoder-decoder models
if (llama_model_has_encoder(model)) {
    if (llama_encode(ctx, batch)) {
        fprintf(stderr, "failed to encode\n");
        return 1;
    }
    llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
    if (decoder_start_token_id == LLAMA_TOKEN_NULL) {
        decoder_start_token_id = llama_vocab_bos(vocab);
    }
    batch = llama_batch_get_one(&decoder_start_token_id, 1);
}

// Main generation loop
int n_decode = 0;
llama_token new_token_id;

for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
    // Forward pass through the transformer
    if (llama_decode(ctx, batch)) {
        fprintf(stderr, "failed to decode\n");
        return 1;
    }

    n_pos += batch.n_tokens;

    // Sample the next token from the logits
    new_token_id = llama_sampler_sample(smpl, ctx, -1);

    // Check for end of generation
    if (llama_vocab_is_eog(vocab, new_token_id)) {
        break;
    }

    // Convert token to text and print
    char buf[128];
    int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
    std::string s(buf, n);
    printf("%s", s.c_str());
    fflush(stdout);

    // Prepare the next batch with just the sampled token
    batch = llama_batch_get_one(&new_token_id, 1);
    n_decode += 1;
}

Using llama_batch_init for Multi-Sequence Batches

// Allocate a batch that can hold up to 512 tokens, each in up to 4 sequences
llama_batch batch = llama_batch_init(512, 0, 4);

// Fill in tokens manually
batch.n_tokens = 3;
batch.token[0] = token_a;  batch.pos[0] = 0;  batch.logits[0] = 0;
batch.token[1] = token_b;  batch.pos[1] = 1;  batch.logits[1] = 0;
batch.token[2] = token_c;  batch.pos[2] = 2;  batch.logits[2] = 1;  // only produce logits here

// Set sequence IDs
batch.n_seq_id[0] = 1;  batch.seq_id[0][0] = 0;
batch.n_seq_id[1] = 1;  batch.seq_id[1][0] = 0;
batch.n_seq_id[2] = 1;  batch.seq_id[2][0] = 0;

int result = llama_decode(ctx, batch);

// Free when done
llama_batch_free(batch);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment