Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Idle Benchmark

From Leeroopedia
Knowledge Sources
Domains Benchmarking, GPU
Last Updated 2026-02-15 00:00 GMT

Overview

Benchmarks whether GPU decode latency remains constant regardless of idle time between invocations.

Description

Loads a model, creates a context with a single BOS token batch, performs a warm-up decode, then iterates with increasing pause durations (0 to 4000ms in 800ms steps). For each pause duration, runs 3 iterations of sleep-then-decode, measuring the decode time. Computes mean and standard deviation to verify that decode latency is independent of the preceding idle period.

Usage

Use this tool to diagnose GPU power management issues such as clock throttling during idle that could cause inconsistent inference latency in production serving scenarios.

Code Reference

Source Location

Signature

static void print_usage(int argc, char ** argv);
int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "log.h"
#include "llama.h"

#include <cmath>
#include <cstdio>
#include <cstring>
#include <string>
#include <thread>
#include <vector>

I/O Contract

Inputs

Name Type Required Description
-m string Yes Path to the GGUF model file
-ngl int No Number of GPU layers to offload

Outputs

Name Type Description
stdout text Decode latency mean and standard deviation for each idle pause duration
return int Exit code: 0 on success, 1 on failure

Usage Examples

# Run idle benchmark with a model
./build/bin/llama-idle -m model.gguf -ngl 99

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment