Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Parallel Decoding

From Leeroopedia
Revision as of 12:41, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Parallel_Decoding.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Parallel_Inference
Last Updated 2026-02-15 00:00 GMT

Overview

Simulates a server handling multiple concurrent client requests in parallel using batch processing with llama.cpp's sequence management API.

Description

Defines a `client` struct tracking per-client state (prompt, sampled tokens, sequence ID, timing). Creates multiple parallel sequences sharing a single context, with a shared system prompt optionally pre-computed once. Uses a built-in set of trivia questions as client prompts. Processes all active clients in a single batch per decode step, managing KV cache allocation per sequence, and tracks timing statistics including prompt processing and generation speed.

Usage

Use this example as a reference for building production serving systems that need to handle multiple concurrent inference requests efficiently using llama.cpp's batch API and sequence management capabilities.

Code Reference

Source Location

Signature

struct client {
    int32_t id = 0;
    llama_seq_id seq_id = -1;
    llama_token sampled;
    int64_t t_start_prompt;
    int64_t t_start_gen;
    int64_t t_prompt = 0;
    int64_t t_gen = 0;
    int32_t n_prompt = 0;
    int32_t n_decoded = 0;
    int32_t i_batch = -1;
    std::string input;
    std::string prompt;
    std::string response;
    std::vector<llama_token> tokens_prev;
};

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cmath>
#include <cstdio>
#include <string>
#include <vector>
#include <ctime>
#include <algorithm>

I/O Contract

Inputs

Name Type Required Description
params.model std::string Yes Path to the GGUF model file
params.n_parallel int No Number of parallel client sequences to simulate
params.prompt std::string No Shared system prompt (prepended to all client prompts)
params.n_predict int No Maximum tokens to generate per client request

Outputs

Name Type Description
client_responses stdout Generated responses for each client printed to standard output
statistics stdout Aggregate timing statistics: prompt processing speed, generation speed, total tokens

Usage Examples

# Run parallel decoding with 8 concurrent clients
./llama-parallel -m model.gguf -n 128 -np 8

# With a custom system prompt
./llama-parallel -m model.gguf -n 128 -np 4 -p "You are a helpful assistant."

# The built-in trivia questions are used as client prompts
# Each client processes independently with its own KV cache sequence

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment