Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Wikipedia Iterator

From Leeroopedia
Revision as of 13:22, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/NVIDIA_NeMo_Curator_Wikipedia_Iterator.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data Parsing, Wikipedia, XML Processing
Last Updated 2026-02-14 00:00 GMT

Overview

WikipediaIterator processes downloaded Wikipedia .bz2 dump files and yields individual article records containing metadata and raw wikitext content.

Description

The WikipediaIterator class extends DocumentIterator and provides streaming XML parsing of Wikipedia dump files. It bridges between the raw downloaded dump archives and the wikitext extraction step.

The iteration process works as follows:

  1. Decompression: Opens the .bz2 compressed file using bz2.BZ2File and wraps it with a UTF-8 codec reader.
  2. XML parsing: Uses xml.etree.ElementTree.iterparse in streaming mode (listening for "end" events) to process the XML incrementally without loading the entire file into memory.
  3. Page detection: For each element whose tag ends with "page", the iterator extracts the XML namespace prefix and processes the page.
  4. Metadata extraction: Extracts the article's title, namespace (ns), and ID from child elements. Skips articles with missing metadata.
  5. Content extraction: Navigates to the <revision>/<text> element to extract the raw wikitext content.
  6. Filtering: Skips articles that are:
    • Not in the main namespace (ns != "0")
    • Redirects (have a <redirect> element)
    • Empty (no text content)
  7. URL construction: Constructs a Wikipedia URL using the language code and URL-encoded title (e.g., https://en.wikipedia.org/wiki/Article_Title).
  8. Memory management: Calls elem.clear() after processing each page element to free memory.

Progress logging is controlled by the log_frequency parameter, defaulting to every 1000 articles.

Usage

Use this class to iterate over downloaded Wikipedia dump files and produce per-article record dictionaries. It is typically used after WikipediaDownloader has fetched the dump files and before a wikitext extractor processes the raw content.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/wikipedia/iterator.py
  • Lines: 1-148

Signature

class WikipediaIterator(DocumentIterator):
    def __init__(
        self,
        language: str = "en",
        log_frequency: int = 1000,
    ): ...

    def iterate(self, file_path: str) -> Iterator[dict[str, Any]]: ...

    def output_columns(self) -> list[str]: ...

Import

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

I/O Contract

Inputs

Name Type Required Description
language str No Language code for the Wikipedia dump (e.g., "en", "de", "fr"). Defaults to "en"
log_frequency int No How often to log progress, measured in number of articles processed. Defaults to 1000

The iterate method accepts:

Name Type Required Description
file_path str or Path Yes Path to a downloaded .bz2 Wikipedia dump file

Outputs

Each yielded record is a dictionary with the following fields:

Name Type Description
title str The article title
id str The Wikipedia article ID
url str Constructed Wikipedia URL (e.g., "https://en.wikipedia.org/wiki/Article_Title")
language str The language code passed during initialization
source_id str The filename of the dump file being processed
raw_content str The raw wikitext content from the article's revision

Usage Examples

Basic Iteration

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

iterator = WikipediaIterator(language="en")

for article in iterator.iterate("/data/wikipedia/raw/enwiki-20240101-pages-articles1.xml.bz2"):
    print(f"Title: {article['title']}")
    print(f"URL: {article['url']}")
    print(f"Content length: {len(article['raw_content'])} chars")

Non-English Wikipedia

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

# Process German Wikipedia dumps
iterator = WikipediaIterator(language="de", log_frequency=5000)

for article in iterator.iterate("/data/wikipedia/raw/dewiki-dump.xml.bz2"):
    print(f"{article['url']}: {article['title']}")

Counting Articles

from nemo_curator.stages.text.download.wikipedia.iterator import WikipediaIterator

iterator = WikipediaIterator(language="en")
count = sum(1 for _ in iterator.iterate("/data/wikipedia/raw/enwiki-dump.xml.bz2"))
print(f"Total articles: {count}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment