Implementation:Ggml org Llama cpp Batched Example
| Knowledge Sources | |
|---|---|
| Domains | Batch_Processing, Example |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Demonstrates batched text generation with multiple parallel sequences from a single prompt using the llama.cpp API.
Description
Loads a model, tokenizes the prompt, and evaluates it once. Then creates N parallel sequences that share the initial prompt's KV cache. In a main loop, it samples the next token for each active sequence using per-sequence samplers (top-k, top-p, temperature, dist), adds new tokens to a batch, and decodes until all sequences finish (EOS or max length). Each sequence gets its own sampler chain and sequence ID.
Usage
Use this example as a reference for implementing parallel inference with KV cache sharing. It demonstrates the core batching API pattern used in production serving scenarios: shared prompt evaluation, per-sequence sampling, and efficient multi-sequence generation.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/batched/batched.cpp
- Lines: 1-261
Signature
static void print_usage(int, char ** argv);
int main(int argc, char ** argv);
Import
#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"
#include "sampling.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m | string (CLI) | Yes | Path to the GGUF model file |
| -p | string (CLI) | No | Prompt text (defaults to "Hello my name is") |
| -n | int (CLI) | No | Number of tokens to predict (defaults to 32) |
| -np | int (CLI) | No | Number of parallel sequences to generate (defaults to params.n_parallel) |
| --sampling params | various (CLI) | No | Sampling parameters (top_k, top_p, temp, seed) |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | text | Generated text sequences, one per parallel stream |
| stderr | text | Performance metrics (tokens/second, timing) |
Usage Examples
# Generate 4 parallel sequences of 32 tokens each
./llama-batched -m model.gguf -p "Hello my name is" -n 32 -np 4
// Key API pattern demonstrated:
// 1. Load model and create context with sufficient KV cache
llama_model * model = llama_model_load_from_file(path, model_params);
llama_context * ctx = llama_init_from_model(model, ctx_params);
// 2. Tokenize and evaluate prompt once for all sequences
std::vector<llama_token> tokens = common_tokenize(vocab, prompt, true);
llama_batch batch = llama_batch_init(max_tokens, 0, n_parallel);
for (size_t i = 0; i < tokens.size(); ++i) {
common_batch_add(batch, tokens[i], i, seq_ids, false);
}
llama_decode(ctx, batch);
// 3. Sample per-sequence in a loop
llama_token new_token = llama_sampler_sample(sampler, ctx, i_batch[i]);
common_batch_add(batch, new_token, n_cur, { i }, true);
llama_decode(ctx, batch);