Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer WikipediaDownloader

From Leeroopedia
Knowledge Sources
Domains Data Acquisition, Wikipedia, Text Extraction
Last Updated 2026-02-14 16:00 GMT

Overview

Downloads Wikipedia XML dump files, parses their multistream bz2 archives, and extracts clean plain text from wiki markup using mwparserfromhell, producing a HuggingFace Dataset of Wikipedia articles.

Description

This module enables acquisition of clean Wikipedia text in any supported language, which is one of the most commonly used high-quality training data sources for LLMs. The majority of the code is adapted from the HuggingFace datasets library's Wikipedia implementation.

Key Classes:

  • WikipediaDownloader -- Uses `wget` to fetch bz2-compressed XML dumps from Wikipedia's dump servers. Skips already-downloaded files.
  • WikipediaIterator -- Uses `xml.etree.cElementTree` to stream-parse the XML dump. Filters to main-namespace (ns=0) non-redirect pages, yielding metadata (title, id, url, language, source_id) and raw wikicode content.
  • WikipediaExtractor -- Parses wikicode with `mwparserfromhell` and performs multi-stage cleaning: removes file/image/media links using language-specific aliases from MEDIA_ALIASES (covering 200+ languages), removes ref/table tags, cleans category links using CAT_ALIASES, strips magic words (e.g., ), and extracts clean text from each section.

Language Support: The module includes comprehensive dictionaries for media and category namespace aliases in 200+ languages, enabling proper cleanup of wiki markup regardless of the target Wikipedia language edition.

Top-level Function:

  • download_wikipedia() -- Orchestrates downloading, iteration, and extraction with configurable language, dump date, URL limits, and item limits.

Usage

Use this module to build Wikipedia text corpora in any supported language for LLM training data preparation.

Code Reference

Source Location

Signature

class WikipediaDownloader(DocumentDownloader):
    def __init__(self, download_dir, verbose=False): ...
    def download(self, url) -> str: ...

class WikipediaIterator(DocumentIterator):
    def __init__(self, language="en", log_frequency=1000): ...
    def iterate(self, file_path): ...  # Generator yielding (meta_dict, raw_content)

class WikipediaExtractor(DocumentExtractor):
    def __init__(self, language="en", parser=mwparserfromhell): ...
    def extract(self, content) -> Tuple[dict, str]: ...

def download_wikipedia(
    output_path: str, language: str = "en", dump_date=None,
    output_type: str = "jsonl", raw_download_dir=None,
    keep_raw_download=False, force_download=False,
    url_limit=None, item_limit=None,
) -> Dataset: ...

Import

from data_juicer.download.wikipedia import (
    WikipediaDownloader,
    WikipediaIterator,
    WikipediaExtractor,
    download_wikipedia,
)

I/O Contract

Inputs

Name Type Required Description
output_path str Yes Root directory for output files
language str No Wikipedia language code (default: "en")
dump_date str No Dump date in "YYYYMMDD" format; if None, uses latest
output_type str No File type for output ("jsonl" by default)
raw_download_dir str No Directory for raw downloads; defaults to output_path/downloads
url_limit int No Maximum number of dump files to download
item_limit int No Maximum number of articles to extract

Outputs

Name Type Description
dataset datasets.Dataset HuggingFace Dataset with columns: text, title, id, url, language, source_id, filename

Usage Examples

from data_juicer.download.wikipedia import download_wikipedia

# Download English Wikipedia
dataset = download_wikipedia(
    output_path="./wiki_data",
    language="en",
    output_type="jsonl",
    url_limit=5,           # Only download first 5 dump files
    force_download=False,
)

# Download Chinese Wikipedia
dataset_zh = download_wikipedia(
    output_path="./wiki_data_zh",
    language="zh",
    dump_date="20240101",
    item_limit=10000,      # Limit to 10k articles
)

print(f"Extracted {len(dataset)} articles")
print(dataset[0].keys())
# ['text', 'title', 'id', 'url', 'language', 'source_id', 'filename']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment