Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator ASR Nemo Stage

From Leeroopedia
Knowledge Sources
Domains Audio Processing, ASR Inference, Data Curation
Last Updated 2026-02-14 00:00 GMT

Overview

Implements InferenceAsrNemoStage, a processing stage that performs automatic speech recognition (ASR) inference using NeMo pretrained models.

Description

InferenceAsrNemoStage extends ProcessingStage[FileGroupTask | DocumentBatch | AudioBatch, AudioBatch] and is a core inference stage for audio curation pipelines. It operates as follows:

  • Setup: In the setup() method, it loads a pretrained NeMo ASR model via nemo_asr.models.ASRModel.from_pretrained(). The model is mapped to GPU or CPU based on the configured Resources (checked via check_cuda()). The setup_on_node() method delegates to setup() for distributed execution.
  • Processing: The process() method accepts multiple input types -- FileGroupTask, DocumentBatch, or AudioBatch -- and extracts audio file paths from each. It validates the input, calls transcribe() with the file paths, and constructs an output AudioBatch with each entry containing the audio filepath and predicted text.
  • Transcription: The transcribe() method calls the model's transcribe() and handles various output formats: tuples (taking the first element), nested lists of Hypothesis objects (extracting .text), and flat lists of outputs.
  • I/O Declaration: The inputs() and outputs() methods declare the stage contract, requiring data as a top-level attribute and producing filepath_key and pred_text_key as data attributes.

Usage

Use this stage in audio curation pipelines to generate ASR transcriptions from audio files. The transcriptions enable downstream quality metrics such as WER/CER computation and text-based filtering. Configure the stage with a NeMo model name and optionally specify GPU resources.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/audio/inference/asr_nemo.py
  • Lines: 1-153

Signature

@dataclass
class InferenceAsrNemoStage(ProcessingStage[FileGroupTask | DocumentBatch | AudioBatch, AudioBatch]):
    model_name: str
    asr_model: Any | None = None
    filepath_key: str = "audio_filepath"
    pred_text_key: str = "pred_text"
    name: str = "ASR_inference"
    batch_size: int = 16
    resources: Resources = field(default_factory=lambda: Resources(cpus=1.0))

    def check_cuda(self) -> torch.device: ...
    def setup_on_node(self, _node_info=None, _worker_metadata=None) -> None: ...
    def setup(self, _worker_metadata=None) -> None: ...
    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def transcribe(self, files: list[str]) -> list[str]: ...
    def process(self, task: FileGroupTask | DocumentBatch | AudioBatch) -> AudioBatch: ...

Import

from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage

I/O Contract

Inputs

Name Type Required Description
model_name str Yes Name of the NeMo ASR model (see NeMo ASR checkpoints)
asr_model Any No Pre-loaded ASR model object (default: None, loaded during setup)
filepath_key str No Key for audio file paths in data entries (default: "audio_filepath")
pred_text_key str No Key for storing predicted transcriptions (default: "pred_text")
batch_size int No Batch size for processing (default: 16)
resources Resources No Compute resources declaration (default: Resources(cpus=1.0))

Process Input

Name Type Required Description
task FileGroupTask, DocumentBatch, or AudioBatch Yes Input task containing audio file paths for transcription

Outputs

Name Type Description
result AudioBatch AudioBatch with entries containing filepath_key and pred_text_key fields

Usage Examples

Basic Usage

from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.resources import Resources

asr_stage = InferenceAsrNemoStage(
    model_name="stt_en_conformer_ctc_large",
    filepath_key="audio_filepath",
    pred_text_key="pred_text",
    batch_size=16,
    resources=Resources(cpus=1.0, gpus=1.0),
)

Using in a Pipeline

from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage

# Create ASR stage for English speech recognition
asr_stage = InferenceAsrNemoStage(
    model_name="stt_en_conformer_ctc_large",
)

# The stage accepts FileGroupTask, DocumentBatch, or AudioBatch inputs
# and produces AudioBatch outputs with predicted text

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment