Implementation:Ggml org Llama cpp Parallel Decoding

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Parallel_Inference
Last Updated	2026-02-15 00:00 GMT

Overview

Simulates a server handling multiple concurrent client requests in parallel using batch processing with llama.cpp's sequence management API.

Description

Defines a `client` struct tracking per-client state (prompt, sampled tokens, sequence ID, timing). Creates multiple parallel sequences sharing a single context, with a shared system prompt optionally pre-computed once. Uses a built-in set of trivia questions as client prompts. Processes all active clients in a single batch per decode step, managing KV cache allocation per sequence, and tracks timing statistics including prompt processing and generation speed.

Usage

Use this example as a reference for building production serving systems that need to handle multiple concurrent inference requests efficiently using llama.cpp's batch API and sequence management capabilities.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/parallel/parallel.cpp
Lines: 1-517

Signature

struct client {
    int32_t id = 0;
    llama_seq_id seq_id = -1;
    llama_token sampled;
    int64_t t_start_prompt;
    int64_t t_start_gen;
    int64_t t_prompt = 0;
    int64_t t_gen = 0;
    int32_t n_prompt = 0;
    int32_t n_decoded = 0;
    int32_t i_batch = -1;
    std::string input;
    std::string prompt;
    std::string response;
    std::vector<llama_token> tokens_prev;
};

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cmath>
#include <cstdio>
#include <string>
#include <vector>
#include <ctime>
#include <algorithm>

I/O Contract

Inputs

Name	Type	Required	Description
params.model	std::string	Yes	Path to the GGUF model file
params.n_parallel	int	No	Number of parallel client sequences to simulate
params.prompt	std::string	No	Shared system prompt (prepended to all client prompts)
params.n_predict	int	No	Maximum tokens to generate per client request

Outputs

Name	Type	Description
client_responses	stdout	Generated responses for each client printed to standard output
statistics	stdout	Aggregate timing statistics: prompt processing speed, generation speed, total tokens

Usage Examples

# Run parallel decoding with 8 concurrent clients
./llama-parallel -m model.gguf -n 128 -np 8

# With a custom system prompt
./llama-parallel -m model.gguf -n 128 -np 4 -p "You are a helpful assistant."

# The built-in trivia questions are used as client prompts
# Each client processes independently with its own KV cache sequence

Related Pages

Principle:Ggml_org_Llama_cpp_Batch_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment