Implementation:NVIDIA NeMo Curator Base URLGenerator

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Acquisition, Abstract Base Class, Pipeline Infrastructure
Last Updated	2026-02-14 00:00 GMT

Overview

URLGenerator is the abstract base class for URL discovery in NeMo Curator download pipelines, and URLGenerationStage is the companion processing stage that wraps any generator into a fan-out stage producing one FileGroupTask per URL for maximum download parallelism.

Description

The URLGenerator ABC defines a single abstract method generate_urls() that returns a list of URLs to download. Subclasses implement source-specific URL discovery logic (e.g., listing S3 buckets, parsing index files, querying APIs).

The URLGenerationStage is a dataclass extending ProcessingStage[_EmptyTask, FileGroupTask]. It provides the following behavior:

Takes an _EmptyTask as input (the pipeline entry point).
Calls url_generator.generate_urls() to obtain the URL list.
Optionally caps the list at limit URLs.
Creates one FileGroupTask per URL for maximum download parallelism. Each task gets a unique ID ({parent_task_id}_{index}) and metadata recording the source URL.
Configured as a fan-out stage in both Ray and Xenna executor specs, with 1 worker per node for the URL generation step itself.

Resource allocation is set to 0.5 CPUs since URL generation is a lightweight operation.

Usage

Subclass URLGenerator to implement URL discovery for specific data sources. Use URLGenerationStage to integrate URL generation as the first step in a download pipeline, typically within a DocumentDownloadExtractStage composite.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/base/url_generation.py
Lines: 1-89

Signature

class URLGenerator(ABC):
    """Abstract base class for URL generators."""

    @abstractmethod
    def generate_urls(self) -> list[str]:
        """Generate a list of URLs to download."""
        ...


@dataclass
class URLGenerationStage(ProcessingStage[_EmptyTask, FileGroupTask]):
    """Stage that generates URLs from minimal input parameters."""

    url_generator: URLGenerator
    limit: int | None = None
    resources = Resources(cpus=0.5)

    def inputs(self) -> tuple[list[str], list[str]]:
        ...

    def outputs(self) -> tuple[list[str], list[str]]:
        ...

    def process(self, task: _EmptyTask) -> list[FileGroupTask]:
        ...

    def ray_stage_spec(self) -> dict[str, Any]:
        ...

    def xenna_stage_spec(self) -> dict[str, Any]:
        ...

Import

from nemo_curator.stages.text.download.base.url_generation import URLGenerator, URLGenerationStage
# Or via the package shortcut:
from nemo_curator.stages.text.download import URLGenerator

I/O Contract

URLGenerator Inputs

Name	Type	Required	Description
(none)	-	-	No parameters; subclasses may define constructor arguments for configuration

URLGenerator Outputs

Name	Type	Description
return value	`list[str]`	List of URLs to download

URLGenerationStage Inputs

Name	Type	Required	Description
url_generator	`URLGenerator`	Yes	The URL generator implementation
limit	None	No	Maximum number of URLs to produce (default: None, all URLs)

URLGenerationStage I/O

Direction	Type	Description
Input	`_EmptyTask`	Empty task serving as the pipeline entry point
Output	`list[FileGroupTask]`	One task per URL, each containing a single URL in `task.data`

Key Behaviors

Fan-Out Pattern

The stage creates one FileGroupTask per URL rather than batching all URLs into a single task. This enables maximum parallelism in the downstream download stage, as each URL can be processed by a separate worker:

# Each URL becomes its own task
return [
    FileGroupTask(
        task_id=f"{task.task_id}_{i}",
        dataset_name=task.dataset_name,
        data=[url],
        _metadata={"source_url": url},
    )
    for i, url in enumerate(urls)
]

Executor Configuration

Executor	Configuration
Ray	`is_fanout_stage: True`
Xenna	`num_workers_per_node: 1`

Usage Examples

Implementing a Custom URL Generator

from nemo_curator.stages.text.download import URLGenerator


class StaticUrlGenerator(URLGenerator):
    """Generates URLs from a predefined list."""

    def __init__(self, urls: list[str]):
        self._urls = urls

    def generate_urls(self) -> list[str]:
        return self._urls

Using URLGenerationStage

from nemo_curator.stages.text.download.base.url_generation import URLGenerationStage

url_stage = URLGenerationStage(
    url_generator=StaticUrlGenerator(urls=["https://example.com/data1.gz", "https://example.com/data2.gz"]),
    limit=1,  # Only process the first URL
)

Known Implementations

ArxivUrlGenerator -- Lists ArXiv tar files from S3

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment