Implementation:Pytorch Serve ASR Emformer Handler

Overview

ASR_Emformer_Handler is a TorchServe handler for the Emformer RNN-T automatic speech recognition (ASR) model. The ModelHandler class loads a TorchScript JIT model and processes audio input through the torchaudio pipeline, extracting features via the EMFORMER_RNNT_BASE_LIBRISPEECH bundle and decoding tokens into text. Unlike most TorchServe handlers, this class does not inherit from BaseHandler; it implements a standalone initialize/handle interface directly.

Field	Value
Implementation Name	ASR_Emformer_Handler
Type	Custom Handler
Workflow	Speech_Recognition_Inference
Domains	Model_Serving, Speech_Recognition
Knowledge Sources	Pytorch_Serve
Last Updated	2026-02-13 18:52 GMT

Description

The ModelHandler class implements a minimal, self-contained handler for serving the Emformer RNN-T ASR model. It bypasses the BaseHandler abstraction to directly manage model loading and the audio-to-text inference pipeline using torchaudio's pre-built bundles.

Key Responsibilities

Model Loading: Loads a TorchScript (.pt) model via torch.jit.load() during initialization
Audio Preprocessing: Accepts raw audio bytes, converts them to waveform tensors, and extracts acoustic features using the EMFORMER_RNNT_BASE_LIBRISPEECH pipeline from torchaudio
Token Decoding: Decodes the model's output token IDs into human-readable transcript text
Standalone Design: Does not inherit from BaseHandler, implementing only initialize() and handle()

Dependencies

Dependency	Purpose
`torch`	JIT model loading and tensor operations
`torchaudio`	Audio processing and `EMFORMER_RNNT_BASE_LIBRISPEECH` feature extraction pipeline

Code Reference

Source Location

File	Lines	Repository
`examples/asr_rnnt_emformer/handler.py`	L10-77	pytorch/serve

Signature

import torch
import torchaudio


class ModelHandler:
    """
    TorchServe handler for Emformer RNN-T ASR model.

    This handler loads a JIT-compiled Emformer model and uses the
    torchaudio EMFORMER_RNNT_BASE_LIBRISPEECH pipeline for feature
    extraction and token decoding.

    Note: Does NOT inherit from BaseHandler.
    """

    def __init__(self):
        self.model = None
        self.device = None
        self.initialized = False

    def initialize(self, context):
        """
        Load the JIT model from the model archive.

        Reads the serialized_file path from the manifest, loads it
        via torch.jit.load(), and moves it to the appropriate device.

        Args:
            context: TorchServe context with system_properties and manifest.
        """
        ...

    def handle(self, data, context):
        """
        Process audio input and return transcribed text.

        Pipeline:
            1. Read raw audio bytes from request body
            2. Convert to waveform tensor
            3. Extract features via EMFORMER_RNNT_BASE_LIBRISPEECH pipeline
            4. Run model inference
            5. Decode output tokens to text

        Args:
            data (list): List of request dicts containing audio bytes.
            context: TorchServe context object.

        Returns:
            list: List of transcription strings.
        """
        ...

I/O Contract

Method	Input	Output	Notes
`initialize(context)`	TorchServe context with model manifest	None (sets `self.model`, `self.device`)	Loads JIT model via `torch.jit.load()`
`handle(data, context)`	List of request dicts with audio bytes in `body`	List of transcription strings	Full pipeline: audio -> features -> inference -> decode

Input Data Format

Field	Type	Description
`body`	bytes	Raw audio data (WAV format recommended, 16kHz sample rate)

Output Data Format

Field	Type	Description
transcription	string	Decoded text transcript of the audio input

Usage Examples

Example 1: Package and serve the Emformer model

# Step 1: Archive the model
# torch-model-archiver --model-name emformer_asr \
#   --version 1.0 \
#   --serialized-file emformer_rnnt.pt \
#   --handler examples/asr_rnnt_emformer/handler.py \
#   --export-path model_store

# Step 2: Start TorchServe
# torchserve --start --model-store model_store \
#   --models emformer_asr=emformer_asr.mar

Example 2: Send audio for transcription

import requests

# Send a WAV file for speech-to-text
with open("speech.wav", "rb") as audio_file:
    response = requests.post(
        "http://localhost:8080/predictions/emformer_asr",
        data=audio_file.read(),
        headers={"Content-Type": "application/octet-stream"},
    )
    print(response.json())
    # Output: ["the quick brown fox jumps over the lazy dog"]

Example 3: Handler pipeline internals

import torch
import torchaudio

# The handle() method internally performs these steps:

# 1. Load audio bytes into waveform
waveform, sample_rate = torchaudio.load(audio_buffer)

# 2. Get the Emformer pipeline bundle
bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH

# 3. Extract features using the bundle's feature extractor
features, feature_lengths = bundle.get_streaming_feature_extractor()(waveform)

# 4. Run inference
with torch.no_grad():
    output = model(features, feature_lengths)

# 5. Decode tokens to text
transcript = bundle.get_decoder()(output)

Related Pages

Principle:Pytorch_Serve_Speech_Recognition_Inference - Speech recognition inference principle this handler implements
Implementation:Pytorch_Serve_BaseHandler - Standard handler base class (not used by this handler)
Implementation:Pytorch_Serve_Generate_Model_Archive - Packages this handler into a .mar archive
Implementation:Pytorch_Serve_Service_Predict - Service layer that invokes handle() on this handler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment