Implementation:NVIDIA NeMo Curator Wikipedia Iterator
| Knowledge Sources | |
|---|---|
| Domains | Data Parsing, Wikipedia, XML Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
WikipediaIterator processes downloaded Wikipedia .bz2 dump files and yields individual article records containing metadata and raw wikitext content.
Description
The WikipediaIterator class extends DocumentIterator and provides streaming XML parsing of Wikipedia dump files. It bridges between the raw downloaded dump archives and the wikitext extraction step.
The iteration process works as follows:
- Decompression: Opens the .bz2 compressed file using
bz2.BZ2Fileand wraps it with a UTF-8 codec reader. - XML parsing: Uses
xml.etree.ElementTree.iterparsein streaming mode (listening for "end" events) to process the XML incrementally without loading the entire file into memory. - Page detection: For each element whose tag ends with "page", the iterator extracts the XML namespace prefix and processes the page.
- Metadata extraction: Extracts the article's title, namespace (ns), and ID from child elements. Skips articles with missing metadata.
- Content extraction: Navigates to the
<revision>/<text>element to extract the raw wikitext content. - Filtering: Skips articles that are:
- Not in the main namespace (ns != "0")
- Redirects (have a
<redirect>element) - Empty (no text content)
- URL construction: Constructs a Wikipedia URL using the language code and URL-encoded title (e.g.,
https://en.wikipedia.org/wiki/Article_Title). - Memory management: Calls
elem.clear()after processing each page element to free memory.
Progress logging is controlled by the log_frequency parameter, defaulting to every 1000 articles.
Usage
Use this class to iterate over downloaded Wikipedia dump files and produce per-article record dictionaries. It is typically used after WikipediaDownloader has fetched the dump files and before a wikitext extractor processes the raw content.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/wikipedia/iterator.py - Lines: 1-148
Signature
class WikipediaIterator(DocumentIterator):
def __init__(
self,
language: str = "en",
log_frequency: int = 1000,
): ...
def iterate(self, file_path: str) -> Iterator[dict[str, Any]]: ...
def output_columns(self) -> list[str]: ...
Import
from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| language | str | No | Language code for the Wikipedia dump (e.g., "en", "de", "fr"). Defaults to "en" |
| log_frequency | int | No | How often to log progress, measured in number of articles processed. Defaults to 1000 |
The iterate method accepts:
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | str or Path | Yes | Path to a downloaded .bz2 Wikipedia dump file |
Outputs
Each yielded record is a dictionary with the following fields:
| Name | Type | Description |
|---|---|---|
| title | str | The article title |
| id | str | The Wikipedia article ID |
| url | str | Constructed Wikipedia URL (e.g., "https://en.wikipedia.org/wiki/Article_Title") |
| language | str | The language code passed during initialization |
| source_id | str | The filename of the dump file being processed |
| raw_content | str | The raw wikitext content from the article's revision |
Usage Examples
Basic Iteration
from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator
iterator = WikipediaIterator(language="en")
for article in iterator.iterate("/data/wikipedia/raw/enwiki-20240101-pages-articles1.xml.bz2"):
print(f"Title: {article['title']}")
print(f"URL: {article['url']}")
print(f"Content length: {len(article['raw_content'])} chars")
Non-English Wikipedia
from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator
# Process German Wikipedia dumps
iterator = WikipediaIterator(language="de", log_frequency=5000)
for article in iterator.iterate("/data/wikipedia/raw/dewiki-dump.xml.bz2"):
print(f"{article['url']}: {article['title']}")
Counting Articles
from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator
iterator = WikipediaIterator(language="en")
count = sum(1 for _ in iterator.iterate("/data/wikipedia/raw/enwiki-dump.xml.bz2"))
print(f"Total articles: {count}")
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_Wikipedia_Downloader - Downloads the dump files that this iterator processes
- NVIDIA_NeMo_Curator_Wikipedia_URLGenerator - Generates URLs for the dump files
- NVIDIA_NeMo_Curator_Wikipedia_Extractor - Extracts clean text from the raw wikitext content