Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ArxivDownloader

From Leeroopedia
Knowledge Sources
Domains Data Acquisition, LaTeX Processing, Arxiv
Last Updated 2026-02-14 16:00 GMT

Overview

Downloads arxiv source tar files from S3, iterates through their LaTeX content, cleans and extracts plain text from TeX files, and produces a HuggingFace Dataset of the extracted text.

Description

This module provides a complete pipeline for acquiring and cleaning arxiv LaTeX source data, which is a key training data source for scientific and mathematical LLM applications. The iterator and extractor code is adapted from the Red-Pajama repository.

Key Classes:

  • ArxivDownloader -- Uses `s5cmd` (with --request-payer=requester) to download tar files from the `s3://arxiv/src` bucket. Skips already-downloaded files.
  • ArxivIterator -- Extracts nested tar/gzip archives into temporary directories, loads `.tex` files, decodes UTF-8 content, and formats arxiv IDs according to the official specification (pre/post March 2007 formats).
  • ArxivExtractor -- Cleans LaTeX content by: (1) removing everything before the first section header, (2) removing line and inline comments, (3) removing content after \appendix or \bibliography, (4) inline-expanding \newcommand and \def macros without arguments.

Top-level Function:

  • download_arxiv() -- Orchestrates the three components via the `download_and_extract` pipeline from the downloader module, producing a HuggingFace Dataset with text, id, source_id, and filename columns.

Usage

Use this module to build arxiv text corpora for LLM training. The pipeline handles downloading, extraction, and cleaning in a single call.

Code Reference

Source Location

Signature

class ArxivDownloader(DocumentDownloader):
    def __init__(self, download_dir, verbose=False): ...
    def download(self, tarfile) -> str: ...

class ArxivIterator(DocumentIterator):
    def __init__(self, log_frequency=1000): ...
    def iterate(self, file_path): ...  # Generator yielding (meta_dict, tex_files)

class ArxivExtractor(DocumentExtractor):
    def extract(self, content) -> Tuple[dict, str]: ...

def download_arxiv(
    output_path: str, output_type: str = "jsonl",
    raw_download_dir=None, keep_raw_download=False,
    force_download=False, url_limit=None,
) -> Dataset: ...

Import

from data_juicer.download.arxiv import (
    ArxivDownloader,
    ArxivIterator,
    ArxivExtractor,
    download_arxiv,
)

I/O Contract

Inputs

Name Type Required Description
output_path str Yes Root directory for output files
output_type str No File type for output ("jsonl" by default)
raw_download_dir str No Directory for raw downloads; defaults to output_path/downloads
keep_raw_download bool No Whether to keep compressed tar files after extraction
force_download bool No If False, skip already-processed files
url_limit int No Maximum number of tar files to download

Outputs

Name Type Description
dataset datasets.Dataset HuggingFace Dataset with columns: text, id, source_id, filename

Usage Examples

from data_juicer.download.arxiv import download_arxiv

# Download and extract arxiv papers
dataset = download_arxiv(
    output_path="./arxiv_data",
    output_type="jsonl",
    url_limit=10,           # Only download first 10 tar files
    force_download=False,   # Skip already processed
    keep_raw_download=True, # Keep raw tar files
)

print(f"Extracted {len(dataset)} papers")
print(dataset[0].keys())  # ['text', 'id', 'source_id', 'filename']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment