Implementation:Mbzuai oryx Awesome LLM Post training Json Load Corpus
| Knowledge Sources | |
|---|---|
| Domains | Curation, Data_Ingestion |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
Concrete tool for loading the collected paper corpus from a JSON data file for manual curation review.
Description
The json.load call reads assets/2000+papers.json, a 37,062-line JSON file containing metadata for 2000+ academic papers collected by the deep paper collection pipeline. The file is structured as an object keyed by Semantic Scholar paper IDs, with each value containing Title, Authors, Abstract, TL;DR, Publication Year, Venue, Link, References, and Citations fields.
This is a Wrapper Doc for Python's built-in json.load function, documenting its specific usage within this repository's curation workflow.
Usage
Load this file at the start of the curation process to access the full collected corpus. The loaded dictionary is browsed manually to identify papers for inclusion in the awesome list.
Code Reference
Source Location
- Repository: Awesome-LLM-Post-training
- File: assets/2000+papers.json (data file, 37,062 lines)
Signature
import json
with open("assets/2000+papers.json", "r", encoding="utf-8") as f:
corpus = json.load(f)
# corpus: dict keyed by Semantic Scholar paper IDs
Import
import json
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | str | Yes | Path to the collected papers JSON file ("assets/2000+papers.json") |
Outputs
| Name | Type | Description |
|---|---|---|
| corpus | dict | Dictionary keyed by Semantic Scholar paper IDs, each value containing paper metadata |
Paper Metadata Structure:
| Key | Type | Description |
|---|---|---|
| Title | str | Paper title |
| Authors | str | Comma-separated author names |
| Abstract | str | Paper abstract |
| TL;DR | str | Auto-generated summary |
| Publication Year | int or str | Year of publication |
| Venue (Conference/Journal) | str | Publication venue |
| Link | str | URL to the paper |
| References | list | Nested reference paper details |
| Citations | list | Nested citing paper details |
Usage Examples
Load and Browse Corpus
import json
# Load the collected paper corpus
with open("assets/2000+papers.json", "r", encoding="utf-8") as f:
corpus = json.load(f)
print(f"Total papers in corpus: {len(corpus)}")
# Browse papers by venue
for paper_id, metadata in corpus.items():
if "NeurIPS" in metadata.get("Venue (Conference/Journal)", ""):
print(f" {metadata['Title']} ({metadata['Publication Year']})")