Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Base URLGenerator

From Leeroopedia
Knowledge Sources
Domains Data Acquisition, Abstract Base Class, Pipeline Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

URLGenerator is the abstract base class for URL discovery in NeMo Curator download pipelines, and URLGenerationStage is the companion processing stage that wraps any generator into a fan-out stage producing one FileGroupTask per URL for maximum download parallelism.

Description

The URLGenerator ABC defines a single abstract method generate_urls() that returns a list of URLs to download. Subclasses implement source-specific URL discovery logic (e.g., listing S3 buckets, parsing index files, querying APIs).

The URLGenerationStage is a dataclass extending ProcessingStage[_EmptyTask, FileGroupTask]. It provides the following behavior:

  • Takes an _EmptyTask as input (the pipeline entry point).
  • Calls url_generator.generate_urls() to obtain the URL list.
  • Optionally caps the list at limit URLs.
  • Creates one FileGroupTask per URL for maximum download parallelism. Each task gets a unique ID ({parent_task_id}_{index}) and metadata recording the source URL.
  • Configured as a fan-out stage in both Ray and Xenna executor specs, with 1 worker per node for the URL generation step itself.

Resource allocation is set to 0.5 CPUs since URL generation is a lightweight operation.

Usage

Subclass URLGenerator to implement URL discovery for specific data sources. Use URLGenerationStage to integrate URL generation as the first step in a download pipeline, typically within a DocumentDownloadExtractStage composite.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/base/url_generation.py
  • Lines: 1-89

Signature

class URLGenerator(ABC):
    """Abstract base class for URL generators."""

    @abstractmethod
    def generate_urls(self) -> list[str]:
        """Generate a list of URLs to download."""
        ...


@dataclass
class URLGenerationStage(ProcessingStage[_EmptyTask, FileGroupTask]):
    """Stage that generates URLs from minimal input parameters."""

    url_generator: URLGenerator
    limit: int | None = None
    resources = Resources(cpus=0.5)

    def inputs(self) -> tuple[list[str], list[str]]:
        ...

    def outputs(self) -> tuple[list[str], list[str]]:
        ...

    def process(self, task: _EmptyTask) -> list[FileGroupTask]:
        ...

    def ray_stage_spec(self) -> dict[str, Any]:
        ...

    def xenna_stage_spec(self) -> dict[str, Any]:
        ...

Import

from nemo_curator.stages.text.download.base.url_generation import URLGenerator, URLGenerationStage
# Or via the package shortcut:
from nemo_curator.stages.text.download import URLGenerator

I/O Contract

URLGenerator Inputs

Name Type Required Description
(none) - - No parameters; subclasses may define constructor arguments for configuration

URLGenerator Outputs

Name Type Description
return value list[str] List of URLs to download

URLGenerationStage Inputs

Name Type Required Description
url_generator URLGenerator Yes The URL generator implementation
limit None No Maximum number of URLs to produce (default: None, all URLs)

URLGenerationStage I/O

Direction Type Description
Input _EmptyTask Empty task serving as the pipeline entry point
Output list[FileGroupTask] One task per URL, each containing a single URL in task.data

Key Behaviors

Fan-Out Pattern

The stage creates one FileGroupTask per URL rather than batching all URLs into a single task. This enables maximum parallelism in the downstream download stage, as each URL can be processed by a separate worker:

# Each URL becomes its own task
return [
    FileGroupTask(
        task_id=f"{task.task_id}_{i}",
        dataset_name=task.dataset_name,
        data=[url],
        _metadata={"source_url": url},
    )
    for i, url in enumerate(urls)
]

Executor Configuration

Executor Configuration
Ray is_fanout_stage: True
Xenna num_workers_per_node: 1

Usage Examples

Implementing a Custom URL Generator

from nemo_curator.stages.text.download import URLGenerator


class StaticUrlGenerator(URLGenerator):
    """Generates URLs from a predefined list."""

    def __init__(self, urls: list[str]):
        self._urls = urls

    def generate_urls(self) -> list[str]:
        return self._urls

Using URLGenerationStage

from nemo_curator.stages.text.download.base.url_generation import URLGenerationStage

url_stage = URLGenerationStage(
    url_generator=StaticUrlGenerator(urls=["https://example.com/data1.gz", "https://example.com/data2.gz"]),
    limit=1,  # Only process the first URL
)

Known Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment