Implementation:Ggml org Llama cpp Batched Bench
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Benchmarks batched decoding performance across configurable combinations of prompt length, generation length, and parallel batch count.
Description
This CLI tool loads a model, performs a warmup decode, then iterates over all combinations of PP (prompt processing), TG (text generation), and PL (parallel batch) values. For each combination, it fills a batch with random tokens (shared or separate prompts), times prompt processing and token generation phases, and reports prompt processing speed (S_PP) and text generation speed (S_TG). The tool supports both Markdown table and JSONL output formats for results, and can run in shared prompt or separate prompt modes.
Usage
Use this tool to measure llama.cpp's batched inference throughput and understand how parallelism, batch sizes, and prompt lengths affect decoding performance on a given model and hardware configuration.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/batched-bench/batched-bench.cpp
- Lines: 1-256
Signature
int main(int argc, char ** argv);
Import
#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m | string (CLI arg) | Yes | Path to the GGUF model file |
| -c | int (CLI arg) | Yes | Context size |
| -b | int (CLI arg) | Yes | Batch size |
| -ub | int (CLI arg) | Yes | Micro-batch size |
| -npp | int list (CLI arg) | Yes | Comma-separated prompt processing lengths (e.g., 128,256,512) |
| -ntg | int list (CLI arg) | Yes | Comma-separated text generation lengths (e.g., 128,256) |
| -npl | int list (CLI arg) | Yes | Comma-separated parallel batch counts (e.g., 1,2,4,8,16,32) |
| -pps | flag (CLI arg) | No | Enable shared prompt mode |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | text | Markdown table or JSONL with PP, TG, B, S_PP (tokens/sec), and S_TG (tokens/sec) for each combination |
Usage Examples
# Run batched benchmark with various configurations
./llama-batched-bench -m model.gguf -c 2048 -b 2048 -ub 512 \
-npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32