Implementation:Ggml org Llama cpp Idle Benchmark
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, GPU |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Benchmarks whether GPU decode latency remains constant regardless of idle time between invocations.
Description
Loads a model, creates a context with a single BOS token batch, performs a warm-up decode, then iterates with increasing pause durations (0 to 4000ms in 800ms steps). For each pause duration, runs 3 iterations of sleep-then-decode, measuring the decode time. Computes mean and standard deviation to verify that decode latency is independent of the preceding idle period.
Usage
Use this tool to diagnose GPU power management issues such as clock throttling during idle that could cause inconsistent inference latency in production serving scenarios.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/idle/idle.cpp
- Lines: 1-110
Signature
static void print_usage(int argc, char ** argv);
int main(int argc, char ** argv);
Import
#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"
#include <cmath>
#include <cstdio>
#include <cstring>
#include <string>
#include <thread>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m | string | Yes | Path to the GGUF model file |
| -ngl | int | No | Number of GPU layers to offload |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | text | Decode latency mean and standard deviation for each idle pause duration |
| return | int | Exit code: 0 on success, 1 on failure |
Usage Examples
# Run idle benchmark with a model
./build/bin/llama-idle -m model.gguf -ngl 99