Implementation:NVIDIA NeMo Curator FLEURS Manifest Stage

Knowledge Sources	NVIDIA NeMo Curator
Domains	Audio Processing, Dataset Ingestion, Data Curation
Last Updated	2026-02-14 00:00 GMT

Overview

Implements CreateInitialManifestFleursStage, a processing stage that downloads the Google FLEURS multilingual speech dataset and creates initial NeMo-style audio manifests from it.

Description

This module provides a dataset ingestion stage for the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) dataset. It contains:

get_fleurs_url_list() -- A helper function that constructs HuggingFace download URLs for a given language code and dataset split. It produces two URLs: a TSV metadata file and a tar.gz audio archive.

CreateInitialManifestFleursStage -- A dataclass-based ProcessingStage[_EmptyTask, AudioBatch] that automates the full acquisition workflow. The process() method calls download_extract_files() to download and extract the archive, then process_transcript() to parse the TSV file. The TSV parser reads tab-delimited lines, extracts the filename (column index 1) and transcript text (column index 2), constructs absolute audio file paths, and groups entries into AudioBatch objects according to batch_size.

Usage

Use this stage as the entry point for FLEURS-based audio curation pipelines. It accepts an _EmptyTask (no input data required) and produces AudioBatch objects containing audio file paths and transcriptions that can be fed to downstream ASR inference, metrics, and filtering stages.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py
Lines: 1-144

Signature

def get_fleurs_url_list(lang: str, split: str) -> list[str]: ...


@dataclass
class CreateInitialManifestFleursStage(ProcessingStage[_EmptyTask, AudioBatch]):
    lang: str
    split: str
    raw_data_dir: str
    filepath_key: str = "audio_filepath"
    text_key: str = "text"
    name: str = "CreateInitialManifestFleurs"
    batch_size: int = 1

    def process_transcript(self, file_path: str) -> list[AudioBatch]: ...
    def download_extract_files(self, dst_folder: str) -> None: ...
    def process(self, _: _EmptyTask) -> list[AudioBatch]: ...

Import

from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
    CreateInitialManifestFleursStage,
    get_fleurs_url_list,
)

I/O Contract

Inputs

Name	Type	Required	Description
lang	str	Yes	Language code using ISO 639-1 and ISO 3166-1 alpha-2 (e.g., "hy_am" for Armenian, "ko_kr" for Korean)
split	str	Yes	Dataset split: "test", "train", or "dev"
raw_data_dir	str	Yes	Path to the folder where the data archive will be downloaded and extracted
filepath_key	str	No	Key name for audio file paths in the output (default: "audio_filepath")
text_key	str	No	Key name for transcription text in the output (default: "text")
batch_size	int	No	Number of entries per AudioBatch (default: 1)

Outputs

Name	Type	Description
result	list[AudioBatch]	List of AudioBatch objects, each containing up to `batch_size` entries with "audio_filepath" and "text" fields

Usage Examples

Basic Usage

from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
    CreateInitialManifestFleursStage,
)

# Create a stage for Korean test split
fleurs_stage = CreateInitialManifestFleursStage(
    lang="ko_kr",
    split="test",
    raw_data_dir="/data/fleurs/korean",
    batch_size=32,
)

Generating Download URLs

from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
    get_fleurs_url_list,
)

urls = get_fleurs_url_list("hy_am", "dev")
# Returns:
# [
#   "https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/dev.tsv",
#   "https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/audio/dev.tar.gz",
# ]

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment