Implementation:Ggml org Llama cpp AIME25 Benchmark Results

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Benchmarking
Last Updated	2026-02-15 00:00 GMT

Overview

JSON data file containing per-problem results for the AIME 2025 math benchmark run on the gpt-oss-120b-high model at temperature 1.0 on DGX Spark hardware.

Description

This file is a benchmark results artifact within the llama.cpp repository's benchmarking infrastructure. It stores complete per-problem results from evaluating the gpt-oss-120b-high model on the AIME 2025 (American Invitational Mathematics Examination) problem set. The data includes individual problem prompts, model-generated responses with step-by-step reasoning, extracted answers, correct answers, scores, and character counts for each response.

Usage

Use this data file to analyze per-problem model performance on the AIME 2025 benchmark. It complements summary benchmark files by providing granular detail on each individual math problem, enabling identification of problem types where the model succeeds or struggles.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: benches/dgx-spark/aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json
Lines: 1-2896

Signature

{
  "score": 0.925,
  "metrics": {
    "chars": 2296.19,
    "chars:std": 986.05,
    "score:std": 0.263
  },
  "htmls": [ ... ]
}

Import

// This is a data file; load it with any JSON parser
// e.g., in Python: json.load(open("aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json"))

I/O Contract

Inputs

Name	Type	Required	Description
N/A	N/A	N/A	This is a static data file with no runtime inputs

Outputs

Name	Type	Description
score	float	Overall accuracy score across all AIME 2025 problems (0.0 to 1.0)
metrics	object	Aggregate statistics including mean character count, standard deviations
htmls	array of string	Per-problem HTML reports containing prompts, model responses, correct answers, extracted answers, and individual scores

Usage Examples

import json

with open("benches/dgx-spark/aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json") as f:
    results = json.load(f)

print(f"Overall score: {results['score']}")
print(f"Mean response length: {results['metrics']['chars']:.0f} chars")
print(f"Number of problems: {len(results['htmls'])}")

Related Pages

Principle:Ggml_org_Llama_cpp_Benchmarking

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment