Implementation:NVIDIA NeMo Curator Wikipedia Iterator

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Parsing, Wikipedia, XML Processing
Last Updated	2026-02-14 00:00 GMT

Overview

WikipediaIterator processes downloaded Wikipedia .bz2 dump files and yields individual article records containing metadata and raw wikitext content.

Description

The WikipediaIterator class extends DocumentIterator and provides streaming XML parsing of Wikipedia dump files. It bridges between the raw downloaded dump archives and the wikitext extraction step.

The iteration process works as follows:

Decompression: Opens the .bz2 compressed file using bz2.BZ2File and wraps it with a UTF-8 codec reader.
XML parsing: Uses xml.etree.ElementTree.iterparse in streaming mode (listening for "end" events) to process the XML incrementally without loading the entire file into memory.
Page detection: For each element whose tag ends with "page", the iterator extracts the XML namespace prefix and processes the page.
Metadata extraction: Extracts the article's title, namespace (ns), and ID from child elements. Skips articles with missing metadata.
Content extraction: Navigates to the <revision>/<text> element to extract the raw wikitext content.
Filtering: Skips articles that are:
- Not in the main namespace (ns != "0")
- Redirects (have a <redirect> element)
- Empty (no text content)
URL construction: Constructs a Wikipedia URL using the language code and URL-encoded title (e.g., https://en.wikipedia.org/wiki/Article_Title).
Memory management: Calls elem.clear() after processing each page element to free memory.

Progress logging is controlled by the log_frequency parameter, defaulting to every 1000 articles.

Usage

Use this class to iterate over downloaded Wikipedia dump files and produce per-article record dictionaries. It is typically used after WikipediaDownloader has fetched the dump files and before a wikitext extractor processes the raw content.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/wikipedia/iterator.py
Lines: 1-148

Signature

class WikipediaIterator(DocumentIterator):
    def __init__(
        self,
        language: str = "en",
        log_frequency: int = 1000,
    ): ...

    def iterate(self, file_path: str) -> Iterator[dict[str, Any]]: ...

    def output_columns(self) -> list[str]: ...

Import

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

I/O Contract

Inputs

Name	Type	Required	Description
language	str	No	Language code for the Wikipedia dump (e.g., "en", "de", "fr"). Defaults to "en"
log_frequency	int	No	How often to log progress, measured in number of articles processed. Defaults to 1000

The iterate method accepts:

Name	Type	Required	Description
file_path	str or Path	Yes	Path to a downloaded .bz2 Wikipedia dump file

Outputs

Each yielded record is a dictionary with the following fields:

Name	Type	Description
title	str	The article title
id	str	The Wikipedia article ID
url	str	Constructed Wikipedia URL (e.g., "https://en.wikipedia.org/wiki/Article_Title")
language	str	The language code passed during initialization
source_id	str	The filename of the dump file being processed
raw_content	str	The raw wikitext content from the article's revision

Usage Examples

Basic Iteration

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

iterator = WikipediaIterator(language="en")

for article in iterator.iterate("/data/wikipedia/raw/enwiki-20240101-pages-articles1.xml.bz2"):
    print(f"Title: {article['title']}")
    print(f"URL: {article['url']}")
    print(f"Content length: {len(article['raw_content'])} chars")

Non-English Wikipedia

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

# Process German Wikipedia dumps
iterator = WikipediaIterator(language="de", log_frequency=5000)

for article in iterator.iterate("/data/wikipedia/raw/dewiki-dump.xml.bz2"):
    print(f"{article['url']}: {article['title']}")

Counting Articles

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

iterator = WikipediaIterator(language="en")
count = sum(1 for _ in iterator.iterate("/data/wikipedia/raw/enwiki-dump.xml.bz2"))
print(f"Total articles: {count}")

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_Wikipedia_Downloader - Downloads the dump files that this iterator processes
NVIDIA_NeMo_Curator_Wikipedia_URLGenerator - Generates URLs for the dump files
NVIDIA_NeMo_Curator_Wikipedia_Extractor - Extracts clean text from the raw wikitext content

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment