Heuristic:ThreeSR Awesome Inference Time Scaling Date Parsing Fallback Tip
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Chronological_Sorting |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Graceful degradation strategy for paper entries with missing or malformed dates: unparseable dates default to datetime.min, causing those entries to sort to the end of the list rather than causing a crash.
Description
The parse_date_from_block() function extracts dates from paper entry blocks using the regex pattern -\s*🗓️\s*\*\*Date:\*\*\s*([\d]{4}-[\d]{2}-[\d]{2}). When the regex does not match (e.g., missing date field, non-standard format) or when datetime.strptime() fails to parse the extracted string, the function returns None.
The calling function (write_to_readme_in_sorted_order()) handles None by substituting datetime.min, which ensures that entries with unparseable dates are placed at the end of the sorted list (since sorting is descending by date, datetime.min is the lowest possible value).
Usage
Use this heuristic when:
- Manually adding paper entries where the publication date is unknown or not in
YYYY-MM-DDformat. - Debugging sort order issues where a paper appears at the bottom of the list unexpectedly.
- Understanding the script's fault tolerance -- the script will not crash on malformed date fields.
The Insight (Rule of Thumb)
- Action: Entries with missing or malformed dates are assigned
datetime.min(year 1, January 1) as their sort key. - Value: These entries will always appear at the end of the chronologically sorted list (newest-first order).
- Trade-off: No crash or data loss, but entries with bad dates may be "hidden" at the bottom of a long list. There is no warning logged when a date fallback occurs (only a print statement if
strptimeitself raises an exception).
Reasoning
Robustness over strictness: when curating a list of hundreds of papers, some entries may have incomplete metadata from the Semantic Scholar API (e.g., preprints without an official publication date). Rather than failing the entire merge operation for one bad entry, the script silently degrades by sorting the problematic entry to the end.
Date extraction logic (fetch_semantic_info.py:77-89):
def parse_date_from_block(block):
"""
Extract the date from the markdown block of a paper entry.
Expected date line format: - 🗓️ **Date:** YYYY-MM-DD
"""
match = re.search(r'-\s*🗓️\s*\*\*Date:\*\*\s*([\d]{4}-[\d]{2}-[\d]{2})', block)
if match:
date_str = match.group(1)
try:
return datetime.strptime(date_str, '%Y-%m-%d')
except Exception as e:
print(f"Error parsing date format: {e}")
return None
Fallback handling in the sort step (fetch_semantic_info.py:170-175):
for entry in all_entries:
dt = parse_date_from_block(entry)
# If the date cannot be parsed, set it to a very early date so that it appears at the end
if dt is None:
dt = datetime.min
merged_entries.append((dt, entry))
The inline comment explicitly documents the design intent: "set it to a very early date so that it appears at the end".