Overview
ASR_Emformer_Handler is a TorchServe handler for the Emformer RNN-T automatic speech recognition (ASR) model. The ModelHandler class loads a TorchScript JIT model and processes audio input through the torchaudio pipeline, extracting features via the EMFORMER_RNNT_BASE_LIBRISPEECH bundle and decoding tokens into text. Unlike most TorchServe handlers, this class does not inherit from BaseHandler; it implements a standalone initialize/handle interface directly.
Description
The ModelHandler class implements a minimal, self-contained handler for serving the Emformer RNN-T ASR model. It bypasses the BaseHandler abstraction to directly manage model loading and the audio-to-text inference pipeline using torchaudio's pre-built bundles.
Key Responsibilities
- Model Loading: Loads a TorchScript (
.pt) model via torch.jit.load() during initialization
- Audio Preprocessing: Accepts raw audio bytes, converts them to waveform tensors, and extracts acoustic features using the
EMFORMER_RNNT_BASE_LIBRISPEECH pipeline from torchaudio
- Token Decoding: Decodes the model's output token IDs into human-readable transcript text
- Standalone Design: Does not inherit from
BaseHandler, implementing only initialize() and handle()
Dependencies
| Dependency |
Purpose
|
torch |
JIT model loading and tensor operations
|
torchaudio |
Audio processing and EMFORMER_RNNT_BASE_LIBRISPEECH feature extraction pipeline
|
Code Reference
Source Location
| File |
Lines |
Repository
|
examples/asr_rnnt_emformer/handler.py |
L10-77 |
pytorch/serve
|
Signature
import torch
import torchaudio
class ModelHandler:
"""
TorchServe handler for Emformer RNN-T ASR model.
This handler loads a JIT-compiled Emformer model and uses the
torchaudio EMFORMER_RNNT_BASE_LIBRISPEECH pipeline for feature
extraction and token decoding.
Note: Does NOT inherit from BaseHandler.
"""
def __init__(self):
self.model = None
self.device = None
self.initialized = False
def initialize(self, context):
"""
Load the JIT model from the model archive.
Reads the serialized_file path from the manifest, loads it
via torch.jit.load(), and moves it to the appropriate device.
Args:
context: TorchServe context with system_properties and manifest.
"""
...
def handle(self, data, context):
"""
Process audio input and return transcribed text.
Pipeline:
1. Read raw audio bytes from request body
2. Convert to waveform tensor
3. Extract features via EMFORMER_RNNT_BASE_LIBRISPEECH pipeline
4. Run model inference
5. Decode output tokens to text
Args:
data (list): List of request dicts containing audio bytes.
context: TorchServe context object.
Returns:
list: List of transcription strings.
"""
...
I/O Contract
| Method |
Input |
Output |
Notes
|
initialize(context) |
TorchServe context with model manifest |
None (sets self.model, self.device) |
Loads JIT model via torch.jit.load()
|
handle(data, context) |
List of request dicts with audio bytes in body |
List of transcription strings |
Full pipeline: audio -> features -> inference -> decode
|
Input Data Format
| Field |
Type |
Description
|
body |
bytes |
Raw audio data (WAV format recommended, 16kHz sample rate)
|
Output Data Format
| Field |
Type |
Description
|
| transcription |
string |
Decoded text transcript of the audio input
|
Usage Examples
Example 1: Package and serve the Emformer model
# Step 1: Archive the model
# torch-model-archiver --model-name emformer_asr \
# --version 1.0 \
# --serialized-file emformer_rnnt.pt \
# --handler examples/asr_rnnt_emformer/handler.py \
# --export-path model_store
# Step 2: Start TorchServe
# torchserve --start --model-store model_store \
# --models emformer_asr=emformer_asr.mar
Example 2: Send audio for transcription
import requests
# Send a WAV file for speech-to-text
with open("speech.wav", "rb") as audio_file:
response = requests.post(
"http://localhost:8080/predictions/emformer_asr",
data=audio_file.read(),
headers={"Content-Type": "application/octet-stream"},
)
print(response.json())
# Output: ["the quick brown fox jumps over the lazy dog"]
Example 3: Handler pipeline internals
import torch
import torchaudio
# The handle() method internally performs these steps:
# 1. Load audio bytes into waveform
waveform, sample_rate = torchaudio.load(audio_buffer)
# 2. Get the Emformer pipeline bundle
bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
# 3. Extract features using the bundle's feature extractor
features, feature_lengths = bundle.get_streaming_feature_extractor()(waveform)
# 4. Run inference
with torch.no_grad():
output = model(features, feature_lengths)
# 5. Decode tokens to text
transcript = bundle.get_decoder()(output)
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.