Implementation:Datajuicer Data juicer ArxivDownloader
| Knowledge Sources | |
|---|---|
| Domains | Data Acquisition, LaTeX Processing, Arxiv |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Downloads arxiv source tar files from S3, iterates through their LaTeX content, cleans and extracts plain text from TeX files, and produces a HuggingFace Dataset of the extracted text.
Description
This module provides a complete pipeline for acquiring and cleaning arxiv LaTeX source data, which is a key training data source for scientific and mathematical LLM applications. The iterator and extractor code is adapted from the Red-Pajama repository.
Key Classes:
- ArxivDownloader -- Uses `s5cmd` (with --request-payer=requester) to download tar files from the `s3://arxiv/src` bucket. Skips already-downloaded files.
- ArxivIterator -- Extracts nested tar/gzip archives into temporary directories, loads `.tex` files, decodes UTF-8 content, and formats arxiv IDs according to the official specification (pre/post March 2007 formats).
- ArxivExtractor -- Cleans LaTeX content by: (1) removing everything before the first section header, (2) removing line and inline comments, (3) removing content after \appendix or \bibliography, (4) inline-expanding \newcommand and \def macros without arguments.
Top-level Function:
- download_arxiv() -- Orchestrates the three components via the `download_and_extract` pipeline from the downloader module, producing a HuggingFace Dataset with text, id, source_id, and filename columns.
Usage
Use this module to build arxiv text corpora for LLM training. The pipeline handles downloading, extraction, and cleaning in a single call.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/download/arxiv.py
- Lines: 1-398
Signature
class ArxivDownloader(DocumentDownloader):
def __init__(self, download_dir, verbose=False): ...
def download(self, tarfile) -> str: ...
class ArxivIterator(DocumentIterator):
def __init__(self, log_frequency=1000): ...
def iterate(self, file_path): ... # Generator yielding (meta_dict, tex_files)
class ArxivExtractor(DocumentExtractor):
def extract(self, content) -> Tuple[dict, str]: ...
def download_arxiv(
output_path: str, output_type: str = "jsonl",
raw_download_dir=None, keep_raw_download=False,
force_download=False, url_limit=None,
) -> Dataset: ...
Import
from data_juicer.download.arxiv import (
ArxivDownloader,
ArxivIterator,
ArxivExtractor,
download_arxiv,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_path | str | Yes | Root directory for output files |
| output_type | str | No | File type for output ("jsonl" by default) |
| raw_download_dir | str | No | Directory for raw downloads; defaults to output_path/downloads |
| keep_raw_download | bool | No | Whether to keep compressed tar files after extraction |
| force_download | bool | No | If False, skip already-processed files |
| url_limit | int | No | Maximum number of tar files to download |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | HuggingFace Dataset with columns: text, id, source_id, filename |
Usage Examples
from data_juicer.download.arxiv import download_arxiv
# Download and extract arxiv papers
dataset = download_arxiv(
output_path="./arxiv_data",
output_type="jsonl",
url_limit=10, # Only download first 10 tar files
force_download=False, # Skip already processed
keep_raw_download=True, # Keep raw tar files
)
print(f"Extracted {len(dataset)} papers")
print(dataset[0].keys()) # ['text', 'id', 'source_id', 'filename']