Implementation:Ggml org Llama cpp Batched Example

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Batch_Processing, Example
Last Updated	2026-02-15 00:00 GMT

Overview

Demonstrates batched text generation with multiple parallel sequences from a single prompt using the llama.cpp API.

Description

Loads a model, tokenizes the prompt, and evaluates it once. Then creates N parallel sequences that share the initial prompt's KV cache. In a main loop, it samples the next token for each active sequence using per-sequence samplers (top-k, top-p, temperature, dist), adds new tokens to a batch, and decodes until all sequences finish (EOS or max length). Each sequence gets its own sampler chain and sequence ID.

Usage

Use this example as a reference for implementing parallel inference with KV cache sharing. It demonstrates the core batching API pattern used in production serving scenarios: shared prompt evaluation, per-sequence sampling, and efficient multi-sequence generation.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/batched/batched.cpp
Lines: 1-261

Signature

static void print_usage(int, char ** argv);
int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"
#include "sampling.h"

I/O Contract

Inputs

Name	Type	Required	Description
-m	string (CLI)	Yes	Path to the GGUF model file
-p	string (CLI)	No	Prompt text (defaults to "Hello my name is")
-n	int (CLI)	No	Number of tokens to predict (defaults to 32)
-np	int (CLI)	No	Number of parallel sequences to generate (defaults to params.n_parallel)
--sampling params	various (CLI)	No	Sampling parameters (top_k, top_p, temp, seed)

Outputs

Name	Type	Description
stdout	text	Generated text sequences, one per parallel stream
stderr	text	Performance metrics (tokens/second, timing)

Usage Examples

# Generate 4 parallel sequences of 32 tokens each
./llama-batched -m model.gguf -p "Hello my name is" -n 32 -np 4

// Key API pattern demonstrated:

// 1. Load model and create context with sufficient KV cache
llama_model * model = llama_model_load_from_file(path, model_params);
llama_context * ctx = llama_init_from_model(model, ctx_params);

// 2. Tokenize and evaluate prompt once for all sequences
std::vector<llama_token> tokens = common_tokenize(vocab, prompt, true);
llama_batch batch = llama_batch_init(max_tokens, 0, n_parallel);
for (size_t i = 0; i < tokens.size(); ++i) {
    common_batch_add(batch, tokens[i], i, seq_ids, false);
}
llama_decode(ctx, batch);

// 3. Sample per-sequence in a loop
llama_token new_token = llama_sampler_sample(sampler, ctx, i_batch[i]);
common_batch_add(batch, new_token, n_cur, { i }, true);
llama_decode(ctx, batch);

Related Pages

Principle:Ggml_org_Llama_cpp_Batch_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment