Heuristic:Mbzuai oryx Awesome LLM Post training Checkpoint Every 3 Papers
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Fault_Tolerance |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Data loss prevention technique that saves intermediate results to a JSON checkpoint file every 3 papers during deep paper collection.
Description
During the deep paper collection pipeline, the script processes papers one at a time through recursive API calls that can take significant time. To prevent data loss from crashes, network failures, or rate-limit exhaustion, the script saves all collected data to a temporary JSON file (papers_temp.json) every 3 papers. This provides a recovery point that limits maximum data loss to 2 papers.
Usage
Apply this heuristic when building long-running data collection pipelines that process items incrementally. The checkpoint frequency of 3 balances between I/O overhead (too frequent saves) and data loss risk (too infrequent saves).
The Insight (Rule of Thumb)
- Action: Save intermediate results to a temporary JSON file at regular intervals during iterative data collection.
- Value: Checkpoint every 3 items processed (configurable based on processing time per item).
- Trade-off: Frequent checkpointing adds minor I/O overhead but limits data loss to at most N-1 items (where N is the checkpoint interval). For expensive API-based collection, this overhead is negligible compared to the cost of re-fetching data.
Reasoning
Each paper in the deep collection pipeline triggers recursive reference and citation fetching, which can involve hundreds of API calls and take several minutes per paper. Losing all progress due to a crash or rate-limit lockout would be costly. The choice of 3 as the checkpoint interval balances:
- I/O cost: Writing a growing JSON file is fast (milliseconds) compared to per-paper API fetching (minutes).
- Data loss risk: At most 2 papers worth of API calls are lost on crash.
- File size: The checkpoint file grows incrementally; at 3-paper intervals the file is rewritten completely (not appended).
Contrast with future_research_data.py: The trend analysis script uses per-keyword checkpointing instead (saving after each keyword is fully processed), because each keyword involves multiple year queries that complete quickly.
Code evidence from `scripts/deep_collection_sementic.py:123-128`:
# Save every 3 papers to avoid data loss
if len(data) % 3 == 0:
with open("papers_temp.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=4)
print(f"Saved {len(data)} papers (Temporary)")
Contrast: per-keyword progressive save from `scripts/future_research_data.py:63-65`:
# Progressively save results to JSON
with open(json_path, "w") as json_file:
json.dump(results_dict, json_file, indent=4)