Implementation:Ggml org Llama cpp Llama Bench

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Benchmarking, Performance
Last Updated	2026-02-15 00:00 GMT

Overview

Full-featured performance benchmarking tool for llama.cpp that measures prompt processing and text generation speeds across configurable parameter combinations.

Description

Llama Bench parses command-line parameters into a `cmd_params` struct supporting multi-value specifications (comma-separated, ranges), then generates all combinations as `cmd_params_instance` objects. For each instance, it loads the model, creates a context, runs warm-up, then executes prompt processing and/or text generation tests with multiple repetitions, collecting timing data into `test` objects. Results are output through a polymorphic `printer` hierarchy supporting Markdown, CSV, JSON, JSONL, and SQL formats.

Usage

Use this tool for performance regression testing, hardware comparisons, optimization validation, and benchmarking inference throughput across different model configurations and backend devices.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/llama-bench/llama-bench.cpp
Lines: 1-2291

Signature

// Main entry point
int main(int argc, char ** argv);

// Core structures
struct cmd_params { /* multi-value CLI parameter sets */ };
struct cmd_params_instance { /* single parameter combination */ };
struct test { /* benchmark result with timing data */ };

// Output printers
struct printer { /* base polymorphic printer */ };
struct csv_printer : public printer { /* CSV output */ };
struct json_printer : public printer { /* JSON output */ };
struct jsonl_printer : public printer { /* JSON Lines output */ };
struct markdown_printer : public printer { /* Markdown table output */ };
struct sql_printer : public printer { /* SQL INSERT output */ };

Import

#include "common.h"
#include "ggml.h"
#include "llama.h"

I/O Contract

Inputs

Name	Type	Required	Description
-m, --model	string	Yes	Path to the GGUF model file to benchmark
-p, --n-prompt	int list	No	Number of prompt tokens (comma-separated or range, default: 512)
-n, --n-gen	int list	No	Number of tokens to generate (comma-separated or range, default: 128)
-b, --batch-size	int list	No	Batch sizes to test
-t, --threads	int list	No	Number of threads to use
-ngl, --n-gpu-layers	int list	No	Number of layers to offload to GPU
-r, --repetitions	int	No	Number of test repetitions (default: 5)
-o, --output	string	No	Output format: md, csv, json, jsonl, sql (default: md)

Outputs

Name	Type	Description
benchmark results	stdout	Formatted benchmark data including tokens/second for prompt processing and generation
return code	int	0 on success, non-zero on failure

Usage Examples

# Basic benchmark with default settings
./llama-bench -m model.gguf

# Test multiple prompt sizes and generation lengths
./llama-bench -m model.gguf -p 128,256,512 -n 64,128 -o csv

# Benchmark with GPU offloading and multiple thread counts
./llama-bench -m model.gguf -ngl 99 -t 4,8,16 -r 3

Related Pages

Principle:Ggml_org_Llama_cpp_Benchmarking

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment