Implementation:Ggml org Llama cpp Chat Generation Loop
| Aspect | Detail |
|---|---|
| Implementation Name | Chat Generation Loop |
| Doc Type | Pattern Doc |
| Category | Inference |
| Workflow | Interactive_Chat |
| Applies To | llama.cpp |
| Status | Active |
Overview
Description
This pattern documents the decode-sample-check EOG-output loop used for generating chat responses in llama.cpp. The loop first processes the entire formatted prompt as a batch, then iteratively decodes one token at a time, samples from the logits, checks for end-of-generation, converts tokens to text for streaming output, and feeds the sampled token back for the next iteration. It also includes context overflow detection before each decode call.
Usage
This pattern is invoked once per assistant turn in the chat conversation. It receives a formatted prompt string (the incremental output from chat template application), tokenizes it, and generates a complete response. The response is returned as a string and subsequently added to the conversation history.
Code Reference
| Attribute | Value |
|---|---|
| Source Location | examples/simple-chat/simple-chat.cpp:98-152
|
| Key Functions | llama_tokenize(), llama_batch_get_one(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_token_to_piece(), llama_memory_seq_pos_max(), llama_n_ctx()
|
| Import | #include "llama.h"
|
Key function signatures used in the loop:
// Tokenize a text string
int32_t llama_tokenize(const struct llama_vocab * vocab, const char * text, int32_t text_len,
llama_token * tokens, int32_t n_tokens_max, bool add_special, bool parse_special);
// Create a single-sequence batch from token array
struct llama_batch llama_batch_get_one(llama_token * tokens, int32_t n_tokens);
// Process a batch through the model
int32_t llama_decode(struct llama_context * ctx, struct llama_batch batch);
// Sample a token from logits using the sampler chain
llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx);
// Check if a token is an end-of-generation marker
bool llama_vocab_is_eog(const struct llama_vocab * vocab, llama_token token);
// Convert a token to its text representation
int32_t llama_token_to_piece(const struct llama_vocab * vocab, llama_token token,
char * buf, int32_t length, int32_t lstrip, bool special);
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | prompt | std::string |
The formatted prompt text (incremental delta from template application) |
| Input | vocab | const llama_vocab * |
Vocabulary for tokenization and EOG detection |
| Input | ctx | llama_context * |
Inference context with KV cache state |
| Input | smpl | llama_sampler * |
Configured sampler chain |
| Output | response | std::string |
The generated response text |
| Side Effect | stdout | Tokens are streamed to stdout as they are generated |
Preconditions:
- Model, context, and sampler must be initialized
- The prompt must be a valid formatted string from
llama_chat_apply_template - Sufficient context window space must be available for at least the prompt tokens
Postconditions:
- The KV cache contains all prompt and generated token positions
- The response string contains the full assistant output (without EOG token)
- The sampler state has been updated (tokens accepted)
Termination conditions:
- The model produces an EOG token (
llama_vocab_is_eogreturns true) - The context window is exhausted (
n_ctx_used + batch.n_tokens > n_ctx)
Usage Examples
Complete generation loop (from simple-chat):
auto generate = [&](const std::string & prompt) {
std::string response;
// Detect if this is the first prompt (for BOS token handling)
const bool is_first = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) == -1;
// Tokenize the prompt
const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
std::vector<llama_token> prompt_tokens(n_prompt_tokens);
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(),
prompt_tokens.size(), is_first, true) < 0) {
GGML_ABORT("failed to tokenize the prompt\n");
}
// Prepare a batch for the prompt
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
llama_token new_token_id;
while (true) {
// Check context space
int n_ctx = llama_n_ctx(ctx);
int n_ctx_used = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) + 1;
if (n_ctx_used + batch.n_tokens > n_ctx) {
fprintf(stderr, "context size exceeded\n");
exit(0);
}
// Decode
int ret = llama_decode(ctx, batch);
if (ret != 0) {
GGML_ABORT("failed to decode, ret = %d\n", ret);
}
// Sample the next token
new_token_id = llama_sampler_sample(smpl, ctx, -1);
// Check for end of generation
if (llama_vocab_is_eog(vocab, new_token_id)) {
break;
}
// Convert to text and stream output
char buf[256];
int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
if (n < 0) {
GGML_ABORT("failed to convert token to piece\n");
}
std::string piece(buf, n);
printf("%s", piece.c_str());
fflush(stdout);
response += piece;
// Prepare next single-token batch
batch = llama_batch_get_one(&new_token_id, 1);
}
return response;
};