Implementation:NVIDIA NeMo Curator FLEURS Manifest Stage
| Knowledge Sources | |
|---|---|
| Domains | Audio Processing, Dataset Ingestion, Data Curation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Implements CreateInitialManifestFleursStage, a processing stage that downloads the Google FLEURS multilingual speech dataset and creates initial NeMo-style audio manifests from it.
Description
This module provides a dataset ingestion stage for the FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) dataset. It contains:
- get_fleurs_url_list() -- A helper function that constructs HuggingFace download URLs for a given language code and dataset split. It produces two URLs: a TSV metadata file and a tar.gz audio archive.
- CreateInitialManifestFleursStage -- A dataclass-based
ProcessingStage[_EmptyTask, AudioBatch]that automates the full acquisition workflow. Theprocess()method callsdownload_extract_files()to download and extract the archive, thenprocess_transcript()to parse the TSV file. The TSV parser reads tab-delimited lines, extracts the filename (column index 1) and transcript text (column index 2), constructs absolute audio file paths, and groups entries intoAudioBatchobjects according tobatch_size.
Usage
Use this stage as the entry point for FLEURS-based audio curation pipelines. It accepts an _EmptyTask (no input data required) and produces AudioBatch objects containing audio file paths and transcriptions that can be fed to downstream ASR inference, metrics, and filtering stages.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/stages/audio/datasets/fleurs/create_initial_manifest.py
- Lines: 1-144
Signature
def get_fleurs_url_list(lang: str, split: str) -> list[str]: ...
@dataclass
class CreateInitialManifestFleursStage(ProcessingStage[_EmptyTask, AudioBatch]):
lang: str
split: str
raw_data_dir: str
filepath_key: str = "audio_filepath"
text_key: str = "text"
name: str = "CreateInitialManifestFleurs"
batch_size: int = 1
def process_transcript(self, file_path: str) -> list[AudioBatch]: ...
def download_extract_files(self, dst_folder: str) -> None: ...
def process(self, _: _EmptyTask) -> list[AudioBatch]: ...
Import
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
CreateInitialManifestFleursStage,
get_fleurs_url_list,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lang | str | Yes | Language code using ISO 639-1 and ISO 3166-1 alpha-2 (e.g., "hy_am" for Armenian, "ko_kr" for Korean) |
| split | str | Yes | Dataset split: "test", "train", or "dev" |
| raw_data_dir | str | Yes | Path to the folder where the data archive will be downloaded and extracted |
| filepath_key | str | No | Key name for audio file paths in the output (default: "audio_filepath") |
| text_key | str | No | Key name for transcription text in the output (default: "text") |
| batch_size | int | No | Number of entries per AudioBatch (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| result | list[AudioBatch] | List of AudioBatch objects, each containing up to batch_size entries with "audio_filepath" and "text" fields
|
Usage Examples
Basic Usage
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
CreateInitialManifestFleursStage,
)
# Create a stage for Korean test split
fleurs_stage = CreateInitialManifestFleursStage(
lang="ko_kr",
split="test",
raw_data_dir="/data/fleurs/korean",
batch_size=32,
)
Generating Download URLs
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import (
get_fleurs_url_list,
)
urls = get_fleurs_url_list("hy_am", "dev")
# Returns:
# [
# "https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/dev.tsv",
# "https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/audio/dev.tar.gz",
# ]