Implementation:Datajuicer Data juicer WikipediaDownloader
| Knowledge Sources | |
|---|---|
| Domains | Data Acquisition, Wikipedia, Text Extraction |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Downloads Wikipedia XML dump files, parses their multistream bz2 archives, and extracts clean plain text from wiki markup using mwparserfromhell, producing a HuggingFace Dataset of Wikipedia articles.
Description
This module enables acquisition of clean Wikipedia text in any supported language, which is one of the most commonly used high-quality training data sources for LLMs. The majority of the code is adapted from the HuggingFace datasets library's Wikipedia implementation.
Key Classes:
- WikipediaDownloader -- Uses `wget` to fetch bz2-compressed XML dumps from Wikipedia's dump servers. Skips already-downloaded files.
- WikipediaIterator -- Uses `xml.etree.cElementTree` to stream-parse the XML dump. Filters to main-namespace (ns=0) non-redirect pages, yielding metadata (title, id, url, language, source_id) and raw wikicode content.
- WikipediaExtractor -- Parses wikicode with `mwparserfromhell` and performs multi-stage cleaning: removes file/image/media links using language-specific aliases from MEDIA_ALIASES (covering 200+ languages), removes ref/table tags, cleans category links using CAT_ALIASES, strips magic words (e.g., ), and extracts clean text from each section.
Language Support: The module includes comprehensive dictionaries for media and category namespace aliases in 200+ languages, enabling proper cleanup of wiki markup regardless of the target Wikipedia language edition.
Top-level Function:
- download_wikipedia() -- Orchestrates downloading, iteration, and extraction with configurable language, dump date, URL limits, and item limits.
Usage
Use this module to build Wikipedia text corpora in any supported language for LLM training data preparation.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/download/wikipedia.py
- Lines: 1-790
Signature
class WikipediaDownloader(DocumentDownloader):
def __init__(self, download_dir, verbose=False): ...
def download(self, url) -> str: ...
class WikipediaIterator(DocumentIterator):
def __init__(self, language="en", log_frequency=1000): ...
def iterate(self, file_path): ... # Generator yielding (meta_dict, raw_content)
class WikipediaExtractor(DocumentExtractor):
def __init__(self, language="en", parser=mwparserfromhell): ...
def extract(self, content) -> Tuple[dict, str]: ...
def download_wikipedia(
output_path: str, language: str = "en", dump_date=None,
output_type: str = "jsonl", raw_download_dir=None,
keep_raw_download=False, force_download=False,
url_limit=None, item_limit=None,
) -> Dataset: ...
Import
from data_juicer.download.wikipedia import (
WikipediaDownloader,
WikipediaIterator,
WikipediaExtractor,
download_wikipedia,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_path | str | Yes | Root directory for output files |
| language | str | No | Wikipedia language code (default: "en") |
| dump_date | str | No | Dump date in "YYYYMMDD" format; if None, uses latest |
| output_type | str | No | File type for output ("jsonl" by default) |
| raw_download_dir | str | No | Directory for raw downloads; defaults to output_path/downloads |
| url_limit | int | No | Maximum number of dump files to download |
| item_limit | int | No | Maximum number of articles to extract |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | HuggingFace Dataset with columns: text, title, id, url, language, source_id, filename |
Usage Examples
from data_juicer.download.wikipedia import download_wikipedia
# Download English Wikipedia
dataset = download_wikipedia(
output_path="./wiki_data",
language="en",
output_type="jsonl",
url_limit=5, # Only download first 5 dump files
force_download=False,
)
# Download Chinese Wikipedia
dataset_zh = download_wikipedia(
output_path="./wiki_data_zh",
language="zh",
dump_date="20240101",
item_limit=10000, # Limit to 10k articles
)
print(f"Extracted {len(dataset)} articles")
print(dataset[0].keys())
# ['text', 'title', 'id', 'url', 'language', 'source_id', 'filename']