Implementation:Ggml org Llama cpp Llama Bench
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Full-featured performance benchmarking tool for llama.cpp that measures prompt processing and text generation speeds across configurable parameter combinations.
Description
Llama Bench parses command-line parameters into a `cmd_params` struct supporting multi-value specifications (comma-separated, ranges), then generates all combinations as `cmd_params_instance` objects. For each instance, it loads the model, creates a context, runs warm-up, then executes prompt processing and/or text generation tests with multiple repetitions, collecting timing data into `test` objects. Results are output through a polymorphic `printer` hierarchy supporting Markdown, CSV, JSON, JSONL, and SQL formats.
Usage
Use this tool for performance regression testing, hardware comparisons, optimization validation, and benchmarking inference throughput across different model configurations and backend devices.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/llama-bench/llama-bench.cpp
- Lines: 1-2291
Signature
// Main entry point
int main(int argc, char ** argv);
// Core structures
struct cmd_params { /* multi-value CLI parameter sets */ };
struct cmd_params_instance { /* single parameter combination */ };
struct test { /* benchmark result with timing data */ };
// Output printers
struct printer { /* base polymorphic printer */ };
struct csv_printer : public printer { /* CSV output */ };
struct json_printer : public printer { /* JSON output */ };
struct jsonl_printer : public printer { /* JSON Lines output */ };
struct markdown_printer : public printer { /* Markdown table output */ };
struct sql_printer : public printer { /* SQL INSERT output */ };
Import
#include "common.h"
#include "ggml.h"
#include "llama.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m, --model | string | Yes | Path to the GGUF model file to benchmark |
| -p, --n-prompt | int list | No | Number of prompt tokens (comma-separated or range, default: 512) |
| -n, --n-gen | int list | No | Number of tokens to generate (comma-separated or range, default: 128) |
| -b, --batch-size | int list | No | Batch sizes to test |
| -t, --threads | int list | No | Number of threads to use |
| -ngl, --n-gpu-layers | int list | No | Number of layers to offload to GPU |
| -r, --repetitions | int | No | Number of test repetitions (default: 5) |
| -o, --output | string | No | Output format: md, csv, json, jsonl, sql (default: md) |
Outputs
| Name | Type | Description |
|---|---|---|
| benchmark results | stdout | Formatted benchmark data including tokens/second for prompt processing and generation |
| return code | int | 0 on success, non-zero on failure |
Usage Examples
# Basic benchmark with default settings
./llama-bench -m model.gguf
# Test multiple prompt sizes and generation lengths
./llama-bench -m model.gguf -p 128,256,512 -n 64,128 -o csv
# Benchmark with GPU offloading and multiple thread counts
./llama-bench -m model.gguf -ngl 99 -t 4,8,16 -r 3