Implementation:Datajuicer Data juicer WikipediaDownloader

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data Acquisition, Wikipedia, Text Extraction
Last Updated	2026-02-14 16:00 GMT

Overview

Downloads Wikipedia XML dump files, parses their multistream bz2 archives, and extracts clean plain text from wiki markup using mwparserfromhell, producing a HuggingFace Dataset of Wikipedia articles.

Description

This module enables acquisition of clean Wikipedia text in any supported language, which is one of the most commonly used high-quality training data sources for LLMs. The majority of the code is adapted from the HuggingFace datasets library's Wikipedia implementation.

Key Classes:

WikipediaDownloader -- Uses `wget` to fetch bz2-compressed XML dumps from Wikipedia's dump servers. Skips already-downloaded files.
WikipediaIterator -- Uses `xml.etree.cElementTree` to stream-parse the XML dump. Filters to main-namespace (ns=0) non-redirect pages, yielding metadata (title, id, url, language, source_id) and raw wikicode content.
WikipediaExtractor -- Parses wikicode with `mwparserfromhell` and performs multi-stage cleaning: removes file/image/media links using language-specific aliases from MEDIA_ALIASES (covering 200+ languages), removes ref/table tags, cleans category links using CAT_ALIASES, strips magic words (e.g., ), and extracts clean text from each section.

Language Support: The module includes comprehensive dictionaries for media and category namespace aliases in 200+ languages, enabling proper cleanup of wiki markup regardless of the target Wikipedia language edition.

Top-level Function:

download_wikipedia() -- Orchestrates downloading, iteration, and extraction with configurable language, dump date, URL limits, and item limits.

Usage

Use this module to build Wikipedia text corpora in any supported language for LLM training data preparation.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/download/wikipedia.py
Lines: 1-790

Signature

class WikipediaDownloader(DocumentDownloader):
    def __init__(self, download_dir, verbose=False): ...
    def download(self, url) -> str: ...

class WikipediaIterator(DocumentIterator):
    def __init__(self, language="en", log_frequency=1000): ...
    def iterate(self, file_path): ...  # Generator yielding (meta_dict, raw_content)

class WikipediaExtractor(DocumentExtractor):
    def __init__(self, language="en", parser=mwparserfromhell): ...
    def extract(self, content) -> Tuple[dict, str]: ...

def download_wikipedia(
    output_path: str, language: str = "en", dump_date=None,
    output_type: str = "jsonl", raw_download_dir=None,
    keep_raw_download=False, force_download=False,
    url_limit=None, item_limit=None,
) -> Dataset: ...

Import

from data_juicer.download.wikipedia import (
    WikipediaDownloader,
    WikipediaIterator,
    WikipediaExtractor,
    download_wikipedia,
)

I/O Contract

Inputs

Name	Type	Required	Description
output_path	str	Yes	Root directory for output files
language	str	No	Wikipedia language code (default: "en")
dump_date	str	No	Dump date in "YYYYMMDD" format; if None, uses latest
output_type	str	No	File type for output ("jsonl" by default)
raw_download_dir	str	No	Directory for raw downloads; defaults to output_path/downloads
url_limit	int	No	Maximum number of dump files to download
item_limit	int	No	Maximum number of articles to extract

Outputs

Name	Type	Description
dataset	datasets.Dataset	HuggingFace Dataset with columns: text, title, id, url, language, source_id, filename

Usage Examples

from data_juicer.download.wikipedia import download_wikipedia

# Download English Wikipedia
dataset = download_wikipedia(
    output_path="./wiki_data",
    language="en",
    output_type="jsonl",
    url_limit=5,           # Only download first 5 dump files
    force_download=False,
)

# Download Chinese Wikipedia
dataset_zh = download_wikipedia(
    output_path="./wiki_data_zh",
    language="zh",
    dump_date="20240101",
    item_limit=10000,      # Limit to 10k articles
)

print(f"Extracted {len(dataset)} articles")
print(dataset[0].keys())
# ['text', 'title', 'id', 'url', 'language', 'source_id', 'filename']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment