Implementation:Mbzuai oryx Awesome LLM Post training Json Normalize Excel Export
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Data_Export |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
Concrete tool for exporting collected paper data to JSON and flattened Excel formats using pandas.
Description
The export block in deep_collection_sementic.py performs two operations: first, it writes the complete data list to a JSON file using json.dump with indent formatting; second, it uses pd.json_normalize to flatten the nested paper dictionaries into a tabular DataFrame and exports it to Excel via df.to_excel. This produces both a machine-readable JSON archive and a human-browsable spreadsheet of the collected corpus.
Usage
Execute this export step after the main collection loop completes. It requires the complete data list of paper detail dictionaries produced by the fetch pipeline.
Code Reference
Source Location
- Repository: Awesome-LLM-Post-training
- File: scripts/deep_collection_sementic.py
- Lines: 130-144
Signature
# Final export block (not a function; inline script logic)
# JSON export
json_filename = "papers.json"
with open(json_filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=4)
# Excel export via pandas normalization
df = pd.json_normalize(data)
df.to_excel("papers.xlsx", index=False)
Import
import json
import pandas as pd
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | list[dict] | Yes | Complete list of paper detail dicts from the collection pipeline |
Outputs
| Name | Type | Description |
|---|---|---|
| papers.json | File | Full JSON export with nested structure preserved (indent=4) |
| papers.xlsx | File | Flattened Excel spreadsheet with one row per paper, nested fields as dotted column names |
Usage Examples
Standard Export After Collection
import json
import pandas as pd
# Assume 'data' is the collected list of paper dicts
json_filename = "papers.json"
with open(json_filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=4)
print(f"JSON saved: {json_filename}")
# Flatten nested structure and export to Excel
df = pd.json_normalize(data)
df.to_excel("papers.xlsx", index=False)
print(f"Excel saved: papers.xlsx ({len(df)} rows)")