Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Batched Example

From Leeroopedia
Knowledge Sources
Domains Batch_Processing, Example
Last Updated 2026-02-15 00:00 GMT

Overview

Demonstrates batched text generation with multiple parallel sequences from a single prompt using the llama.cpp API.

Description

Loads a model, tokenizes the prompt, and evaluates it once. Then creates N parallel sequences that share the initial prompt's KV cache. In a main loop, it samples the next token for each active sequence using per-sequence samplers (top-k, top-p, temperature, dist), adds new tokens to a batch, and decodes until all sequences finish (EOS or max length). Each sequence gets its own sampler chain and sequence ID.

Usage

Use this example as a reference for implementing parallel inference with KV cache sharing. It demonstrates the core batching API pattern used in production serving scenarios: shared prompt evaluation, per-sequence sampling, and efficient multi-sequence generation.

Code Reference

Source Location

Signature

static void print_usage(int, char ** argv);
int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"
#include "sampling.h"

I/O Contract

Inputs

Name Type Required Description
-m string (CLI) Yes Path to the GGUF model file
-p string (CLI) No Prompt text (defaults to "Hello my name is")
-n int (CLI) No Number of tokens to predict (defaults to 32)
-np int (CLI) No Number of parallel sequences to generate (defaults to params.n_parallel)
--sampling params various (CLI) No Sampling parameters (top_k, top_p, temp, seed)

Outputs

Name Type Description
stdout text Generated text sequences, one per parallel stream
stderr text Performance metrics (tokens/second, timing)

Usage Examples

# Generate 4 parallel sequences of 32 tokens each
./llama-batched -m model.gguf -p "Hello my name is" -n 32 -np 4
// Key API pattern demonstrated:

// 1. Load model and create context with sufficient KV cache
llama_model * model = llama_model_load_from_file(path, model_params);
llama_context * ctx = llama_init_from_model(model, ctx_params);

// 2. Tokenize and evaluate prompt once for all sequences
std::vector<llama_token> tokens = common_tokenize(vocab, prompt, true);
llama_batch batch = llama_batch_init(max_tokens, 0, n_parallel);
for (size_t i = 0; i < tokens.size(); ++i) {
    common_batch_add(batch, tokens[i], i, seq_ids, false);
}
llama_decode(ctx, batch);

// 3. Sample per-sequence in a loop
llama_token new_token = llama_sampler_sample(sampler, ctx, i_batch[i]);
common_batch_add(batch, new_token, n_cur, { i }, true);
llama_decode(ctx, batch);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment