Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Chat Generation Loop

From Leeroopedia
Aspect Detail
Implementation Name Chat Generation Loop
Doc Type Pattern Doc
Category Inference
Workflow Interactive_Chat
Applies To llama.cpp
Status Active

Overview

Description

This pattern documents the decode-sample-check EOG-output loop used for generating chat responses in llama.cpp. The loop first processes the entire formatted prompt as a batch, then iteratively decodes one token at a time, samples from the logits, checks for end-of-generation, converts tokens to text for streaming output, and feeds the sampled token back for the next iteration. It also includes context overflow detection before each decode call.

Usage

This pattern is invoked once per assistant turn in the chat conversation. It receives a formatted prompt string (the incremental output from chat template application), tokenizes it, and generates a complete response. The response is returned as a string and subsequently added to the conversation history.

Code Reference

Attribute Value
Source Location examples/simple-chat/simple-chat.cpp:98-152
Key Functions llama_tokenize(), llama_batch_get_one(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_token_to_piece(), llama_memory_seq_pos_max(), llama_n_ctx()
Import #include "llama.h"

Key function signatures used in the loop:

// Tokenize a text string
int32_t llama_tokenize(const struct llama_vocab * vocab, const char * text, int32_t text_len,
                       llama_token * tokens, int32_t n_tokens_max, bool add_special, bool parse_special);

// Create a single-sequence batch from token array
struct llama_batch llama_batch_get_one(llama_token * tokens, int32_t n_tokens);

// Process a batch through the model
int32_t llama_decode(struct llama_context * ctx, struct llama_batch batch);

// Sample a token from logits using the sampler chain
llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx);

// Check if a token is an end-of-generation marker
bool llama_vocab_is_eog(const struct llama_vocab * vocab, llama_token token);

// Convert a token to its text representation
int32_t llama_token_to_piece(const struct llama_vocab * vocab, llama_token token,
                             char * buf, int32_t length, int32_t lstrip, bool special);

I/O Contract

Direction Name Type Description
Input prompt std::string The formatted prompt text (incremental delta from template application)
Input vocab const llama_vocab * Vocabulary for tokenization and EOG detection
Input ctx llama_context * Inference context with KV cache state
Input smpl llama_sampler * Configured sampler chain
Output response std::string The generated response text
Side Effect stdout Tokens are streamed to stdout as they are generated

Preconditions:

  • Model, context, and sampler must be initialized
  • The prompt must be a valid formatted string from llama_chat_apply_template
  • Sufficient context window space must be available for at least the prompt tokens

Postconditions:

  • The KV cache contains all prompt and generated token positions
  • The response string contains the full assistant output (without EOG token)
  • The sampler state has been updated (tokens accepted)

Termination conditions:

  • The model produces an EOG token (llama_vocab_is_eog returns true)
  • The context window is exhausted (n_ctx_used + batch.n_tokens > n_ctx)

Usage Examples

Complete generation loop (from simple-chat):

auto generate = [&](const std::string & prompt) {
    std::string response;

    // Detect if this is the first prompt (for BOS token handling)
    const bool is_first = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) == -1;

    // Tokenize the prompt
    const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
    std::vector<llama_token> prompt_tokens(n_prompt_tokens);
    if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(),
                       prompt_tokens.size(), is_first, true) < 0) {
        GGML_ABORT("failed to tokenize the prompt\n");
    }

    // Prepare a batch for the prompt
    llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
    llama_token new_token_id;

    while (true) {
        // Check context space
        int n_ctx = llama_n_ctx(ctx);
        int n_ctx_used = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) + 1;
        if (n_ctx_used + batch.n_tokens > n_ctx) {
            fprintf(stderr, "context size exceeded\n");
            exit(0);
        }

        // Decode
        int ret = llama_decode(ctx, batch);
        if (ret != 0) {
            GGML_ABORT("failed to decode, ret = %d\n", ret);
        }

        // Sample the next token
        new_token_id = llama_sampler_sample(smpl, ctx, -1);

        // Check for end of generation
        if (llama_vocab_is_eog(vocab, new_token_id)) {
            break;
        }

        // Convert to text and stream output
        char buf[256];
        int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
        if (n < 0) {
            GGML_ABORT("failed to convert token to piece\n");
        }
        std::string piece(buf, n);
        printf("%s", piece.c_str());
        fflush(stdout);
        response += piece;

        // Prepare next single-token batch
        batch = llama_batch_get_one(&new_token_id, 1);
    }

    return response;
};

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment