Implementation:NVIDIA NeMo Curator ASR Nemo Stage

Knowledge Sources	NVIDIA NeMo Curator
Domains	Audio Processing, ASR Inference, Data Curation
Last Updated	2026-02-14 00:00 GMT

Overview

Implements InferenceAsrNemoStage, a processing stage that performs automatic speech recognition (ASR) inference using NeMo pretrained models.

Description

InferenceAsrNemoStage extends ProcessingStage[FileGroupTask | DocumentBatch | AudioBatch, AudioBatch] and is a core inference stage for audio curation pipelines. It operates as follows:

Setup: In the setup() method, it loads a pretrained NeMo ASR model via nemo_asr.models.ASRModel.from_pretrained(). The model is mapped to GPU or CPU based on the configured Resources (checked via check_cuda()). The setup_on_node() method delegates to setup() for distributed execution.

Processing: The process() method accepts multiple input types -- FileGroupTask, DocumentBatch, or AudioBatch -- and extracts audio file paths from each. It validates the input, calls transcribe() with the file paths, and constructs an output AudioBatch with each entry containing the audio filepath and predicted text.

Transcription: The transcribe() method calls the model's transcribe() and handles various output formats: tuples (taking the first element), nested lists of Hypothesis objects (extracting .text), and flat lists of outputs.

I/O Declaration: The inputs() and outputs() methods declare the stage contract, requiring data as a top-level attribute and producing filepath_key and pred_text_key as data attributes.

Usage

Use this stage in audio curation pipelines to generate ASR transcriptions from audio files. The transcriptions enable downstream quality metrics such as WER/CER computation and text-based filtering. Configure the stage with a NeMo model name and optionally specify GPU resources.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/audio/inference/asr_nemo.py
Lines: 1-153

Signature

@dataclass
class InferenceAsrNemoStage(ProcessingStage[FileGroupTask | DocumentBatch | AudioBatch, AudioBatch]):
    model_name: str
    asr_model: Any | None = None
    filepath_key: str = "audio_filepath"
    pred_text_key: str = "pred_text"
    name: str = "ASR_inference"
    batch_size: int = 16
    resources: Resources = field(default_factory=lambda: Resources(cpus=1.0))

    def check_cuda(self) -> torch.device: ...
    def setup_on_node(self, _node_info=None, _worker_metadata=None) -> None: ...
    def setup(self, _worker_metadata=None) -> None: ...
    def inputs(self) -> tuple[list[str], list[str]]: ...
    def outputs(self) -> tuple[list[str], list[str]]: ...
    def transcribe(self, files: list[str]) -> list[str]: ...
    def process(self, task: FileGroupTask | DocumentBatch | AudioBatch) -> AudioBatch: ...

Import

from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage

I/O Contract

Inputs

Name	Type	Required	Description
model_name	str	Yes	Name of the NeMo ASR model (see NeMo ASR checkpoints)
asr_model	Any	No	Pre-loaded ASR model object (default: None, loaded during setup)
filepath_key	str	No	Key for audio file paths in data entries (default: "audio_filepath")
pred_text_key	str	No	Key for storing predicted transcriptions (default: "pred_text")
batch_size	int	No	Batch size for processing (default: 16)
resources	Resources	No	Compute resources declaration (default: Resources(cpus=1.0))

Process Input

Name	Type	Required	Description
task	FileGroupTask, DocumentBatch, or AudioBatch	Yes	Input task containing audio file paths for transcription

Outputs

Name	Type	Description
result	AudioBatch	AudioBatch with entries containing filepath_key and pred_text_key fields

Usage Examples

Basic Usage

from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.resources import Resources

asr_stage = InferenceAsrNemoStage(
    model_name="stt_en_conformer_ctc_large",
    filepath_key="audio_filepath",
    pred_text_key="pred_text",
    batch_size=16,
    resources=Resources(cpus=1.0, gpus=1.0),
)

Using in a Pipeline

from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage

# Create ASR stage for English speech recognition
asr_stage = InferenceAsrNemoStage(
    model_name="stt_en_conformer_ctc_large",
)

# The stage accepts FileGroupTask, DocumentBatch, or AudioBatch inputs
# and produces AudioBatch outputs with predicted text

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment