Implementation:Ggml org Llama cpp Idle Benchmark

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Benchmarking, GPU
Last Updated	2026-02-15 00:00 GMT

Overview

Benchmarks whether GPU decode latency remains constant regardless of idle time between invocations.

Description

Loads a model, creates a context with a single BOS token batch, performs a warm-up decode, then iterates with increasing pause durations (0 to 4000ms in 800ms steps). For each pause duration, runs 3 iterations of sleep-then-decode, measuring the decode time. Computes mean and standard deviation to verify that decode latency is independent of the preceding idle period.

Usage

Use this tool to diagnose GPU power management issues such as clock throttling during idle that could cause inconsistent inference latency in production serving scenarios.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/idle/idle.cpp
Lines: 1-110

Signature

static void print_usage(int argc, char ** argv);
int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"

#include <cmath>
#include <cstdio>
#include <cstring>
#include <string>
#include <thread>
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
-m	string	Yes	Path to the GGUF model file
-ngl	int	No	Number of GPU layers to offload

Outputs

Name	Type	Description
stdout	text	Decode latency mean and standard deviation for each idle pause duration
return	int	Exit code: 0 on success, 1 on failure

Usage Examples

# Run idle benchmark with a model
./build/bin/llama-idle -m model.gguf -ngl 99

Related Pages

Principle:Ggml_org_Llama_cpp_GPU_Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment