Overview
Implements InferenceAsrNemoStage, a processing stage that performs automatic speech recognition (ASR) inference using NeMo pretrained models.
Description
InferenceAsrNemoStage extends ProcessingStage[FileGroupTask | DocumentBatch | AudioBatch, AudioBatch] and is a core inference stage for audio curation pipelines. It operates as follows:
- Setup: In the
setup() method, it loads a pretrained NeMo ASR model via nemo_asr.models.ASRModel.from_pretrained(). The model is mapped to GPU or CPU based on the configured Resources (checked via check_cuda()). The setup_on_node() method delegates to setup() for distributed execution.
- Processing: The
process() method accepts multiple input types -- FileGroupTask, DocumentBatch, or AudioBatch -- and extracts audio file paths from each. It validates the input, calls transcribe() with the file paths, and constructs an output AudioBatch with each entry containing the audio filepath and predicted text.
- Transcription: The
transcribe() method calls the model's transcribe() and handles various output formats: tuples (taking the first element), nested lists of Hypothesis objects (extracting .text), and flat lists of outputs.
- I/O Declaration: The
inputs() and outputs() methods declare the stage contract, requiring data as a top-level attribute and producing filepath_key and pred_text_key as data attributes.
Usage
Use this stage in audio curation pipelines to generate ASR transcriptions from audio files. The transcriptions enable downstream quality metrics such as WER/CER computation and text-based filtering. Configure the stage with a NeMo model name and optionally specify GPU resources.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/stages/audio/inference/asr_nemo.py
- Lines: 1-153
Signature
@dataclass
class InferenceAsrNemoStage(ProcessingStage[FileGroupTask | DocumentBatch | AudioBatch, AudioBatch]):
model_name: str
asr_model: Any | None = None
filepath_key: str = "audio_filepath"
pred_text_key: str = "pred_text"
name: str = "ASR_inference"
batch_size: int = 16
resources: Resources = field(default_factory=lambda: Resources(cpus=1.0))
def check_cuda(self) -> torch.device: ...
def setup_on_node(self, _node_info=None, _worker_metadata=None) -> None: ...
def setup(self, _worker_metadata=None) -> None: ...
def inputs(self) -> tuple[list[str], list[str]]: ...
def outputs(self) -> tuple[list[str], list[str]]: ...
def transcribe(self, files: list[str]) -> list[str]: ...
def process(self, task: FileGroupTask | DocumentBatch | AudioBatch) -> AudioBatch: ...
Import
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
I/O Contract
Inputs
| Name |
Type |
Required |
Description
|
| model_name |
str |
Yes |
Name of the NeMo ASR model (see NeMo ASR checkpoints)
|
| asr_model |
Any |
No |
Pre-loaded ASR model object (default: None, loaded during setup)
|
| filepath_key |
str |
No |
Key for audio file paths in data entries (default: "audio_filepath")
|
| pred_text_key |
str |
No |
Key for storing predicted transcriptions (default: "pred_text")
|
| batch_size |
int |
No |
Batch size for processing (default: 16)
|
| resources |
Resources |
No |
Compute resources declaration (default: Resources(cpus=1.0))
|
Process Input
| Name |
Type |
Required |
Description
|
| task |
FileGroupTask, DocumentBatch, or AudioBatch |
Yes |
Input task containing audio file paths for transcription
|
Outputs
| Name |
Type |
Description
|
| result |
AudioBatch |
AudioBatch with entries containing filepath_key and pred_text_key fields
|
Usage Examples
Basic Usage
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.resources import Resources
asr_stage = InferenceAsrNemoStage(
model_name="stt_en_conformer_ctc_large",
filepath_key="audio_filepath",
pred_text_key="pred_text",
batch_size=16,
resources=Resources(cpus=1.0, gpus=1.0),
)
Using in a Pipeline
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
# Create ASR stage for English speech recognition
asr_stage = InferenceAsrNemoStage(
model_name="stt_en_conformer_ctc_large",
)
# The stage accepts FileGroupTask, DocumentBatch, or AudioBatch inputs
# and produces AudioBatch outputs with predicted text
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.