Implementation:Ggml org Llama cpp AIME25 Benchmark Results
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
JSON data file containing per-problem results for the AIME 2025 math benchmark run on the gpt-oss-120b-high model at temperature 1.0 on DGX Spark hardware.
Description
This file is a benchmark results artifact within the llama.cpp repository's benchmarking infrastructure. It stores complete per-problem results from evaluating the gpt-oss-120b-high model on the AIME 2025 (American Invitational Mathematics Examination) problem set. The data includes individual problem prompts, model-generated responses with step-by-step reasoning, extracted answers, correct answers, scores, and character counts for each response.
Usage
Use this data file to analyze per-problem model performance on the AIME 2025 benchmark. It complements summary benchmark files by providing granular detail on each individual math problem, enabling identification of problem types where the model succeeds or struggles.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: benches/dgx-spark/aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json
- Lines: 1-2896
Signature
{
"score": 0.925,
"metrics": {
"chars": 2296.19,
"chars:std": 986.05,
"score:std": 0.263
},
"htmls": [ ... ]
}
Import
// This is a data file; load it with any JSON parser
// e.g., in Python: json.load(open("aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json"))
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| N/A | N/A | N/A | This is a static data file with no runtime inputs |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | Overall accuracy score across all AIME 2025 problems (0.0 to 1.0) |
| metrics | object | Aggregate statistics including mean character count, standard deviations |
| htmls | array of string | Per-problem HTML reports containing prompts, model responses, correct answers, extracted answers, and individual scores |
Usage Examples
import json
with open("benches/dgx-spark/aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json") as f:
results = json.load(f)
print(f"Overall score: {results['score']}")
print(f"Mean response length: {results['metrics']['chars']:.0f} chars")
print(f"Number of problems: {len(results['htmls'])}")