Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Batched Bench

From Leeroopedia
Knowledge Sources
Domains Benchmarking, Performance
Last Updated 2026-02-15 00:00 GMT

Overview

Benchmarks batched decoding performance across configurable combinations of prompt length, generation length, and parallel batch count.

Description

This CLI tool loads a model, performs a warmup decode, then iterates over all combinations of PP (prompt processing), TG (text generation), and PL (parallel batch) values. For each combination, it fills a batch with random tokens (shared or separate prompts), times prompt processing and token generation phases, and reports prompt processing speed (S_PP) and text generation speed (S_TG). The tool supports both Markdown table and JSONL output formats for results, and can run in shared prompt or separate prompt modes.

Usage

Use this tool to measure llama.cpp's batched inference throughput and understand how parallelism, batch sizes, and prompt lengths affect decoding performance on a given model and hardware configuration.

Code Reference

Source Location

Signature

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"

I/O Contract

Inputs

Name Type Required Description
-m string (CLI arg) Yes Path to the GGUF model file
-c int (CLI arg) Yes Context size
-b int (CLI arg) Yes Batch size
-ub int (CLI arg) Yes Micro-batch size
-npp int list (CLI arg) Yes Comma-separated prompt processing lengths (e.g., 128,256,512)
-ntg int list (CLI arg) Yes Comma-separated text generation lengths (e.g., 128,256)
-npl int list (CLI arg) Yes Comma-separated parallel batch counts (e.g., 1,2,4,8,16,32)
-pps flag (CLI arg) No Enable shared prompt mode

Outputs

Name Type Description
stdout text Markdown table or JSONL with PP, TG, B, S_PP (tokens/sec), and S_TG (tokens/sec) for each combination

Usage Examples

# Run batched benchmark with various configurations
./llama-batched-bench -m model.gguf -c 2048 -b 2048 -ub 512 \
    -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment