Implementation:Ggml org Llama cpp Parallel Decoding
| Knowledge Sources | |
|---|---|
| Domains | Parallel_Inference |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Simulates a server handling multiple concurrent client requests in parallel using batch processing with llama.cpp's sequence management API.
Description
Defines a `client` struct tracking per-client state (prompt, sampled tokens, sequence ID, timing). Creates multiple parallel sequences sharing a single context, with a shared system prompt optionally pre-computed once. Uses a built-in set of trivia questions as client prompts. Processes all active clients in a single batch per decode step, managing KV cache allocation per sequence, and tracks timing statistics including prompt processing and generation speed.
Usage
Use this example as a reference for building production serving systems that need to handle multiple concurrent inference requests efficiently using llama.cpp's batch API and sequence management capabilities.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/parallel/parallel.cpp
- Lines: 1-517
Signature
struct client {
int32_t id = 0;
llama_seq_id seq_id = -1;
llama_token sampled;
int64_t t_start_prompt;
int64_t t_start_gen;
int64_t t_prompt = 0;
int64_t t_gen = 0;
int32_t n_prompt = 0;
int32_t n_decoded = 0;
int32_t i_batch = -1;
std::string input;
std::string prompt;
std::string response;
std::vector<llama_token> tokens_prev;
};
int main(int argc, char ** argv);
Import
#include "arg.h"
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cmath>
#include <cstdio>
#include <string>
#include <vector>
#include <ctime>
#include <algorithm>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| params.model | std::string | Yes | Path to the GGUF model file |
| params.n_parallel | int | No | Number of parallel client sequences to simulate |
| params.prompt | std::string | No | Shared system prompt (prepended to all client prompts) |
| params.n_predict | int | No | Maximum tokens to generate per client request |
Outputs
| Name | Type | Description |
|---|---|---|
| client_responses | stdout | Generated responses for each client printed to standard output |
| statistics | stdout | Aggregate timing statistics: prompt processing speed, generation speed, total tokens |
Usage Examples
# Run parallel decoding with 8 concurrent clients
./llama-parallel -m model.gguf -n 128 -np 8
# With a custom system prompt
./llama-parallel -m model.gguf -n 128 -np 4 -p "You are a helpful assistant."
# The built-in trivia questions are used as client prompts
# Each client processes independently with its own KV cache sequence