Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator FLEURS Manifest Stage

From Leeroopedia
Knowledge Sources
Domains Audio Processing, Dataset Ingestion, Data Curation
Last Updated 2026-02-14 00:00 GMT

Overview

Implements CreateInitialManifestFleursStage, a processing stage that downloads the Google FLEURS multilingual speech dataset and creates initial NeMo-style audio manifests from it.

Description

This module provides a dataset ingestion stage for the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) dataset. It contains:

  • get_fleurs_url_list() -- A helper function that constructs HuggingFace download URLs for a given language code and dataset split. It produces two URLs: a TSV metadata file and a tar.gz audio archive.
  • CreateInitialManifestFleursStage -- A dataclass-based ProcessingStage[_EmptyTask, AudioBatch] that automates the full acquisition workflow. The process() method calls download_extract_files() to download and extract the archive, then process_transcript() to parse the TSV file. The TSV parser reads tab-delimited lines, extracts the filename (column index 1) and transcript text (column index 2), constructs absolute audio file paths, and groups entries into AudioBatch objects according to batch_size.

Usage

Use this stage as the entry point for FLEURS-based audio curation pipelines. It accepts an _EmptyTask (no input data required) and produces AudioBatch objects containing audio file paths and transcriptions that can be fed to downstream ASR inference, metrics, and filtering stages.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py
  • Lines: 1-144

Signature

def get_fleurs_url_list(lang: str, split: str) -> list[str]: ...


@dataclass
class CreateInitialManifestFleursStage(ProcessingStage[_EmptyTask, AudioBatch]):
    lang: str
    split: str
    raw_data_dir: str
    filepath_key: str = "audio_filepath"
    text_key: str = "text"
    name: str = "CreateInitialManifestFleurs"
    batch_size: int = 1

    def process_transcript(self, file_path: str) -> list[AudioBatch]: ...
    def download_extract_files(self, dst_folder: str) -> None: ...
    def process(self, _: _EmptyTask) -> list[AudioBatch]: ...

Import

from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
    CreateInitialManifestFleursStage,
    get_fleurs_url_list,
)

I/O Contract

Inputs

Name Type Required Description
lang str Yes Language code using ISO 639-1 and ISO 3166-1 alpha-2 (e.g., "hy_am" for Armenian, "ko_kr" for Korean)
split str Yes Dataset split: "test", "train", or "dev"
raw_data_dir str Yes Path to the folder where the data archive will be downloaded and extracted
filepath_key str No Key name for audio file paths in the output (default: "audio_filepath")
text_key str No Key name for transcription text in the output (default: "text")
batch_size int No Number of entries per AudioBatch (default: 1)

Outputs

Name Type Description
result list[AudioBatch] List of AudioBatch objects, each containing up to batch_size entries with "audio_filepath" and "text" fields

Usage Examples

Basic Usage

from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
    CreateInitialManifestFleursStage,
)

# Create a stage for Korean test split
fleurs_stage = CreateInitialManifestFleursStage(
    lang="ko_kr",
    split="test",
    raw_data_dir="/data/fleurs/korean",
    batch_size=32,
)

Generating Download URLs

from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
    get_fleurs_url_list,
)

urls = get_fleurs_url_list("hy_am", "dev")
# Returns:
# [
#   "https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/dev.tsv",
#   "https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/audio/dev.tar.gz",
# ]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment