Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pytorch Serve ASR Emformer Handler

From Leeroopedia

Overview

ASR_Emformer_Handler is a TorchServe handler for the Emformer RNN-T automatic speech recognition (ASR) model. The ModelHandler class loads a TorchScript JIT model and processes audio input through the torchaudio pipeline, extracting features via the EMFORMER_RNNT_BASE_LIBRISPEECH bundle and decoding tokens into text. Unlike most TorchServe handlers, this class does not inherit from BaseHandler; it implements a standalone initialize/handle interface directly.

Field Value
Implementation Name ASR_Emformer_Handler
Type Custom Handler
Workflow Speech_Recognition_Inference
Domains Model_Serving, Speech_Recognition
Knowledge Sources Pytorch_Serve
Last Updated 2026-02-13 18:52 GMT

Description

The ModelHandler class implements a minimal, self-contained handler for serving the Emformer RNN-T ASR model. It bypasses the BaseHandler abstraction to directly manage model loading and the audio-to-text inference pipeline using torchaudio's pre-built bundles.

Key Responsibilities

  • Model Loading: Loads a TorchScript (.pt) model via torch.jit.load() during initialization
  • Audio Preprocessing: Accepts raw audio bytes, converts them to waveform tensors, and extracts acoustic features using the EMFORMER_RNNT_BASE_LIBRISPEECH pipeline from torchaudio
  • Token Decoding: Decodes the model's output token IDs into human-readable transcript text
  • Standalone Design: Does not inherit from BaseHandler, implementing only initialize() and handle()

Dependencies

Dependency Purpose
torch JIT model loading and tensor operations
torchaudio Audio processing and EMFORMER_RNNT_BASE_LIBRISPEECH feature extraction pipeline

Code Reference

Source Location

File Lines Repository
examples/asr_rnnt_emformer/handler.py L10-77 pytorch/serve

Signature

import torch
import torchaudio


class ModelHandler:
    """
    TorchServe handler for Emformer RNN-T ASR model.

    This handler loads a JIT-compiled Emformer model and uses the
    torchaudio EMFORMER_RNNT_BASE_LIBRISPEECH pipeline for feature
    extraction and token decoding.

    Note: Does NOT inherit from BaseHandler.
    """

    def __init__(self):
        self.model = None
        self.device = None
        self.initialized = False

    def initialize(self, context):
        """
        Load the JIT model from the model archive.

        Reads the serialized_file path from the manifest, loads it
        via torch.jit.load(), and moves it to the appropriate device.

        Args:
            context: TorchServe context with system_properties and manifest.
        """
        ...

    def handle(self, data, context):
        """
        Process audio input and return transcribed text.

        Pipeline:
            1. Read raw audio bytes from request body
            2. Convert to waveform tensor
            3. Extract features via EMFORMER_RNNT_BASE_LIBRISPEECH pipeline
            4. Run model inference
            5. Decode output tokens to text

        Args:
            data (list): List of request dicts containing audio bytes.
            context: TorchServe context object.

        Returns:
            list: List of transcription strings.
        """
        ...

I/O Contract

Method Input Output Notes
initialize(context) TorchServe context with model manifest None (sets self.model, self.device) Loads JIT model via torch.jit.load()
handle(data, context) List of request dicts with audio bytes in body List of transcription strings Full pipeline: audio -> features -> inference -> decode

Input Data Format

Field Type Description
body bytes Raw audio data (WAV format recommended, 16kHz sample rate)

Output Data Format

Field Type Description
transcription string Decoded text transcript of the audio input

Usage Examples

Example 1: Package and serve the Emformer model

# Step 1: Archive the model
# torch-model-archiver --model-name emformer_asr \
#   --version 1.0 \
#   --serialized-file emformer_rnnt.pt \
#   --handler examples/asr_rnnt_emformer/handler.py \
#   --export-path model_store

# Step 2: Start TorchServe
# torchserve --start --model-store model_store \
#   --models emformer_asr=emformer_asr.mar

Example 2: Send audio for transcription

import requests

# Send a WAV file for speech-to-text
with open("speech.wav", "rb") as audio_file:
    response = requests.post(
        "http://localhost:8080/predictions/emformer_asr",
        data=audio_file.read(),
        headers={"Content-Type": "application/octet-stream"},
    )
    print(response.json())
    # Output: ["the quick brown fox jumps over the lazy dog"]

Example 3: Handler pipeline internals

import torch
import torchaudio

# The handle() method internally performs these steps:

# 1. Load audio bytes into waveform
waveform, sample_rate = torchaudio.load(audio_buffer)

# 2. Get the Emformer pipeline bundle
bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH

# 3. Extract features using the bundle's feature extractor
features, feature_lengths = bundle.get_streaming_feature_extractor()(waveform)

# 4. Run inference
with torch.no_grad():
    output = model(features, feature_lengths)

# 5. Decode tokens to text
transcript = bundle.get_decoder()(output)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment