Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp AIME25 Benchmark Results

From Leeroopedia
Revision as of 12:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_AIME25_Benchmark_Results.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Benchmarking
Last Updated 2026-02-15 00:00 GMT

Overview

JSON data file containing per-problem results for the AIME 2025 math benchmark run on the gpt-oss-120b-high model at temperature 1.0 on DGX Spark hardware.

Description

This file is a benchmark results artifact within the llama.cpp repository's benchmarking infrastructure. It stores complete per-problem results from evaluating the gpt-oss-120b-high model on the AIME 2025 (American Invitational Mathematics Examination) problem set. The data includes individual problem prompts, model-generated responses with step-by-step reasoning, extracted answers, correct answers, scores, and character counts for each response.

Usage

Use this data file to analyze per-problem model performance on the AIME 2025 benchmark. It complements summary benchmark files by providing granular detail on each individual math problem, enabling identification of problem types where the model succeeds or struggles.

Code Reference

Source Location

  • Repository: Ggml_org_Llama_cpp
  • File: benches/dgx-spark/aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json
  • Lines: 1-2896

Signature

{
  "score": 0.925,
  "metrics": {
    "chars": 2296.19,
    "chars:std": 986.05,
    "score:std": 0.263
  },
  "htmls": [ ... ]
}

Import

// This is a data file; load it with any JSON parser
// e.g., in Python: json.load(open("aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json"))

I/O Contract

Inputs

Name Type Required Description
N/A N/A N/A This is a static data file with no runtime inputs

Outputs

Name Type Description
score float Overall accuracy score across all AIME 2025 problems (0.0 to 1.0)
metrics object Aggregate statistics including mean character count, standard deviations
htmls array of string Per-problem HTML reports containing prompts, model responses, correct answers, extracted answers, and individual scores

Usage Examples

import json

with open("benches/dgx-spark/aime25_openai__gpt-oss-120b-high_temp1.0_20251109_094547_allresults.json") as f:
    results = json.load(f)

print(f"Overall score: {results['score']}")
print(f"Mean response length: {results['metrics']['chars']:.0f} chars")
print(f"Number of problems: {len(results['htmls'])}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment