Implementation:Datajuicer Data juicer ArxivDownloader

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data Acquisition, LaTeX Processing, Arxiv
Last Updated	2026-02-14 16:00 GMT

Overview

Downloads arxiv source tar files from S3, iterates through their LaTeX content, cleans and extracts plain text from TeX files, and produces a HuggingFace Dataset of the extracted text.

Description

This module provides a complete pipeline for acquiring and cleaning arxiv LaTeX source data, which is a key training data source for scientific and mathematical LLM applications. The iterator and extractor code is adapted from the Red-Pajama repository.

Key Classes:

ArxivDownloader -- Uses `s5cmd` (with --request-payer=requester) to download tar files from the `s3://arxiv/src` bucket. Skips already-downloaded files.
ArxivIterator -- Extracts nested tar/gzip archives into temporary directories, loads `.tex` files, decodes UTF-8 content, and formats arxiv IDs according to the official specification (pre/post March 2007 formats).
ArxivExtractor -- Cleans LaTeX content by: (1) removing everything before the first section header, (2) removing line and inline comments, (3) removing content after \appendix or \bibliography, (4) inline-expanding \newcommand and \def macros without arguments.

Top-level Function:

download_arxiv() -- Orchestrates the three components via the `download_and_extract` pipeline from the downloader module, producing a HuggingFace Dataset with text, id, source_id, and filename columns.

Usage

Use this module to build arxiv text corpora for LLM training. The pipeline handles downloading, extraction, and cleaning in a single call.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/download/arxiv.py
Lines: 1-398

Signature

class ArxivDownloader(DocumentDownloader):
    def __init__(self, download_dir, verbose=False): ...
    def download(self, tarfile) -> str: ...

class ArxivIterator(DocumentIterator):
    def __init__(self, log_frequency=1000): ...
    def iterate(self, file_path): ...  # Generator yielding (meta_dict, tex_files)

class ArxivExtractor(DocumentExtractor):
    def extract(self, content) -> Tuple[dict, str]: ...

def download_arxiv(
    output_path: str, output_type: str = "jsonl",
    raw_download_dir=None, keep_raw_download=False,
    force_download=False, url_limit=None,
) -> Dataset: ...

Import

from data_juicer.download.arxiv import (
    ArxivDownloader,
    ArxivIterator,
    ArxivExtractor,
    download_arxiv,
)

I/O Contract

Inputs

Name	Type	Required	Description
output_path	str	Yes	Root directory for output files
output_type	str	No	File type for output ("jsonl" by default)
raw_download_dir	str	No	Directory for raw downloads; defaults to output_path/downloads
keep_raw_download	bool	No	Whether to keep compressed tar files after extraction
force_download	bool	No	If False, skip already-processed files
url_limit	int	No	Maximum number of tar files to download

Outputs

Name	Type	Description
dataset	datasets.Dataset	HuggingFace Dataset with columns: text, id, source_id, filename

Usage Examples

from data_juicer.download.arxiv import download_arxiv

# Download and extract arxiv papers
dataset = download_arxiv(
    output_path="./arxiv_data",
    output_type="jsonl",
    url_limit=10,           # Only download first 10 tar files
    force_download=False,   # Skip already processed
    keep_raw_download=True, # Keep raw tar files
)

print(f"Extracted {len(dataset)} papers")
print(dataset[0].keys())  # ['text', 'id', 'source_id', 'filename']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment