Implementation:Ggml org Llama cpp Chat Generation Loop

Aspect	Detail
Implementation Name	Chat Generation Loop
Doc Type	Pattern Doc
Category	Inference
Workflow	Interactive_Chat
Applies To	llama.cpp
Status	Active

Overview

Description

This pattern documents the decode-sample-check EOG-output loop used for generating chat responses in llama.cpp. The loop first processes the entire formatted prompt as a batch, then iteratively decodes one token at a time, samples from the logits, checks for end-of-generation, converts tokens to text for streaming output, and feeds the sampled token back for the next iteration. It also includes context overflow detection before each decode call.

Usage

This pattern is invoked once per assistant turn in the chat conversation. It receives a formatted prompt string (the incremental output from chat template application), tokenizes it, and generates a complete response. The response is returned as a string and subsequently added to the conversation history.

Code Reference

Attribute	Value
Source Location	`examples/simple-chat/simple-chat.cpp:98-152`
Key Functions	`llama_tokenize()`, `llama_batch_get_one()`, `llama_decode()`, `llama_sampler_sample()`, `llama_vocab_is_eog()`, `llama_token_to_piece()`, `llama_memory_seq_pos_max()`, `llama_n_ctx()`
Import	`#include "llama.h"`

Key function signatures used in the loop:

// Tokenize a text string
int32_t llama_tokenize(const struct llama_vocab * vocab, const char * text, int32_t text_len,
                       llama_token * tokens, int32_t n_tokens_max, bool add_special, bool parse_special);

// Create a single-sequence batch from token array
struct llama_batch llama_batch_get_one(llama_token * tokens, int32_t n_tokens);

// Process a batch through the model
int32_t llama_decode(struct llama_context * ctx, struct llama_batch batch);

// Sample a token from logits using the sampler chain
llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx);

// Check if a token is an end-of-generation marker
bool llama_vocab_is_eog(const struct llama_vocab * vocab, llama_token token);

// Convert a token to its text representation
int32_t llama_token_to_piece(const struct llama_vocab * vocab, llama_token token,
                             char * buf, int32_t length, int32_t lstrip, bool special);

I/O Contract

Direction	Name	Type	Description
Input	prompt	`std::string`	The formatted prompt text (incremental delta from template application)
Input	vocab	`const llama_vocab *`	Vocabulary for tokenization and EOG detection
Input	ctx	`llama_context *`	Inference context with KV cache state
Input	smpl	`llama_sampler *`	Configured sampler chain
Output	response	`std::string`	The generated response text
Side Effect	stdout		Tokens are streamed to stdout as they are generated

Preconditions:

Model, context, and sampler must be initialized
The prompt must be a valid formatted string from llama_chat_apply_template
Sufficient context window space must be available for at least the prompt tokens

Postconditions:

The KV cache contains all prompt and generated token positions
The response string contains the full assistant output (without EOG token)
The sampler state has been updated (tokens accepted)

Termination conditions:

The model produces an EOG token (llama_vocab_is_eog returns true)
The context window is exhausted (n_ctx_used + batch.n_tokens > n_ctx)

Usage Examples

Complete generation loop (from simple-chat):

auto generate = [&](const std::string & prompt) {
    std::string response;

    // Detect if this is the first prompt (for BOS token handling)
    const bool is_first = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) == -1;

    // Tokenize the prompt
    const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
    std::vector<llama_token> prompt_tokens(n_prompt_tokens);
    if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(),
                       prompt_tokens.size(), is_first, true) < 0) {
        GGML_ABORT("failed to tokenize the prompt\n");
    }

    // Prepare a batch for the prompt
    llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
    llama_token new_token_id;

    while (true) {
        // Check context space
        int n_ctx = llama_n_ctx(ctx);
        int n_ctx_used = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) + 1;
        if (n_ctx_used + batch.n_tokens > n_ctx) {
            fprintf(stderr, "context size exceeded\n");
            exit(0);
        }

        // Decode
        int ret = llama_decode(ctx, batch);
        if (ret != 0) {
            GGML_ABORT("failed to decode, ret = %d\n", ret);
        }

        // Sample the next token
        new_token_id = llama_sampler_sample(smpl, ctx, -1);

        // Check for end of generation
        if (llama_vocab_is_eog(vocab, new_token_id)) {
            break;
        }

        // Convert to text and stream output
        char buf[256];
        int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
        if (n < 0) {
            GGML_ABORT("failed to convert token to piece\n");
        }
        std::string piece(buf, n);
        printf("%s", piece.c_str());
        fflush(stdout);
        response += piece;

        // Prepare next single-token batch
        batch = llama_batch_get_one(&new_token_id, 1);
    }

    return response;
};

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment