Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mbzuai oryx Awesome LLM Post training Json Load Corpus

From Leeroopedia


Knowledge Sources
Domains Curation, Data_Ingestion
Last Updated 2026-02-08 07:30 GMT

Overview

Concrete tool for loading the collected paper corpus from a JSON data file for manual curation review.

Description

The json.load call reads assets/2000+papers.json, a 37,062-line JSON file containing metadata for 2000+ academic papers collected by the deep paper collection pipeline. The file is structured as an object keyed by Semantic Scholar paper IDs, with each value containing Title, Authors, Abstract, TL;DR, Publication Year, Venue, Link, References, and Citations fields.

This is a Wrapper Doc for Python's built-in json.load function, documenting its specific usage within this repository's curation workflow.

Usage

Load this file at the start of the curation process to access the full collected corpus. The loaded dictionary is browsed manually to identify papers for inclusion in the awesome list.

Code Reference

Source Location

Signature

import json

with open("assets/2000+papers.json", "r", encoding="utf-8") as f:
    corpus = json.load(f)
# corpus: dict keyed by Semantic Scholar paper IDs

Import

import json

I/O Contract

Inputs

Name Type Required Description
file_path str Yes Path to the collected papers JSON file ("assets/2000+papers.json")

Outputs

Name Type Description
corpus dict Dictionary keyed by Semantic Scholar paper IDs, each value containing paper metadata

Paper Metadata Structure:

Key Type Description
Title str Paper title
Authors str Comma-separated author names
Abstract str Paper abstract
TL;DR str Auto-generated summary
Publication Year int or str Year of publication
Venue (Conference/Journal) str Publication venue
Link str URL to the paper
References list Nested reference paper details
Citations list Nested citing paper details

Usage Examples

Load and Browse Corpus

import json

# Load the collected paper corpus
with open("assets/2000+papers.json", "r", encoding="utf-8") as f:
    corpus = json.load(f)

print(f"Total papers in corpus: {len(corpus)}")

# Browse papers by venue
for paper_id, metadata in corpus.items():
    if "NeurIPS" in metadata.get("Venue (Conference/Journal)", ""):
        print(f"  {metadata['Title']} ({metadata['Publication Year']})")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment