Implementation:NVIDIA NeMo Curator Base URLGenerator
| Knowledge Sources | |
|---|---|
| Domains | Data Acquisition, Abstract Base Class, Pipeline Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
URLGenerator is the abstract base class for URL discovery in NeMo Curator download pipelines, and URLGenerationStage is the companion processing stage that wraps any generator into a fan-out stage producing one FileGroupTask per URL for maximum download parallelism.
Description
The URLGenerator ABC defines a single abstract method generate_urls() that returns a list of URLs to download. Subclasses implement source-specific URL discovery logic (e.g., listing S3 buckets, parsing index files, querying APIs).
The URLGenerationStage is a dataclass extending ProcessingStage[_EmptyTask, FileGroupTask]. It provides the following behavior:
- Takes an
_EmptyTaskas input (the pipeline entry point). - Calls
url_generator.generate_urls()to obtain the URL list. - Optionally caps the list at
limitURLs. - Creates one
FileGroupTaskper URL for maximum download parallelism. Each task gets a unique ID ({parent_task_id}_{index}) and metadata recording the source URL. - Configured as a fan-out stage in both Ray and Xenna executor specs, with 1 worker per node for the URL generation step itself.
Resource allocation is set to 0.5 CPUs since URL generation is a lightweight operation.
Usage
Subclass URLGenerator to implement URL discovery for specific data sources. Use URLGenerationStage to integrate URL generation as the first step in a download pipeline, typically within a DocumentDownloadExtractStage composite.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/base/url_generation.py - Lines: 1-89
Signature
class URLGenerator(ABC):
"""Abstract base class for URL generators."""
@abstractmethod
def generate_urls(self) -> list[str]:
"""Generate a list of URLs to download."""
...
@dataclass
class URLGenerationStage(ProcessingStage[_EmptyTask, FileGroupTask]):
"""Stage that generates URLs from minimal input parameters."""
url_generator: URLGenerator
limit: int | None = None
resources = Resources(cpus=0.5)
def inputs(self) -> tuple[list[str], list[str]]:
...
def outputs(self) -> tuple[list[str], list[str]]:
...
def process(self, task: _EmptyTask) -> list[FileGroupTask]:
...
def ray_stage_spec(self) -> dict[str, Any]:
...
def xenna_stage_spec(self) -> dict[str, Any]:
...
Import
from nemo_curator.stages.text.download.base.url_generation import URLGenerator, URLGenerationStage
# Or via the package shortcut:
from nemo_curator.stages.text.download import URLGenerator
I/O Contract
URLGenerator Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | - | - | No parameters; subclasses may define constructor arguments for configuration |
URLGenerator Outputs
| Name | Type | Description |
|---|---|---|
| return value | list[str] |
List of URLs to download |
URLGenerationStage Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| url_generator | URLGenerator |
Yes | The URL generator implementation |
| limit | None | No | Maximum number of URLs to produce (default: None, all URLs) |
URLGenerationStage I/O
| Direction | Type | Description |
|---|---|---|
| Input | _EmptyTask |
Empty task serving as the pipeline entry point |
| Output | list[FileGroupTask] |
One task per URL, each containing a single URL in task.data
|
Key Behaviors
Fan-Out Pattern
The stage creates one FileGroupTask per URL rather than batching all URLs into a single task. This enables maximum parallelism in the downstream download stage, as each URL can be processed by a separate worker:
# Each URL becomes its own task
return [
FileGroupTask(
task_id=f"{task.task_id}_{i}",
dataset_name=task.dataset_name,
data=[url],
_metadata={"source_url": url},
)
for i, url in enumerate(urls)
]
Executor Configuration
| Executor | Configuration |
|---|---|
| Ray | is_fanout_stage: True
|
| Xenna | num_workers_per_node: 1
|
Usage Examples
Implementing a Custom URL Generator
from nemo_curator.stages.text.download import URLGenerator
class StaticUrlGenerator(URLGenerator):
"""Generates URLs from a predefined list."""
def __init__(self, urls: list[str]):
self._urls = urls
def generate_urls(self) -> list[str]:
return self._urls
Using URLGenerationStage
from nemo_curator.stages.text.download.base.url_generation import URLGenerationStage
url_stage = URLGenerationStage(
url_generator=StaticUrlGenerator(urls=["https://example.com/data1.gz", "https://example.com/data2.gz"]),
limit=1, # Only process the first URL
)
Known Implementations
- ArxivUrlGenerator -- Lists ArXiv tar files from S3