Overview
MMFHandler is a TorchServe custom handler for MMF (Multimodal Framework) activity recognition models. It extends BaseHandler to support multimodal inference over video, audio, and text inputs, loading an MMF model with its associated configuration, processors, and activity label mappings from a CSV file. The handler uses omegaconf for configuration management, pandas for label loading, and the MMF framework for model and data processing.
Description
The MMFHandler class demonstrates how to serve a multimodal model (specifically the MMFTransformer for activity recognition) through TorchServe. It overrides all four stages of the BaseHandler pipeline to handle the unique requirements of multimodal data processing.
Key Responsibilities
- Model Loading: Loads an MMF model using the MMF framework's model registry, along with OmegaConf-based configuration for processors and data pipeline
- Label Mapping: Reads activity labels from a CSV file using
pandas, mapping numeric predictions to human-readable activity names
- Multimodal Preprocessing: Processes video frames, audio waveforms, and text descriptions through MMF's processor pipeline, assembling them into a
SampleList
- Inference: Runs the
MMFTransformer model on the assembled SampleList
- Postprocessing: Maps model output indices to activity label strings
Dependencies
| Dependency |
Purpose
|
mmf |
MMF framework for multimodal model loading and processing
|
omegaconf |
Configuration management for MMF model and processor configs
|
pandas |
Loading activity labels from CSV
|
torch |
PyTorch tensor operations
|
ts.torch_handler.base_handler |
Parent class providing the handler lifecycle
|
Code Reference
Source Location
| File |
Lines |
Repository
|
examples/MMF-activity-recognition/handler.py |
L34-147 |
pytorch/serve
|
Signature
from ts.torch_handler.base_handler import BaseHandler
class MMFHandler(BaseHandler):
"""
TorchServe handler for MMF multimodal activity recognition.
Processes video, audio, and text inputs through MMF processors,
runs inference with MMFTransformer, and maps outputs to activity labels.
"""
def initialize(self, context):
"""
Load MMF model, configuration, processors, and activity labels.
Sets up:
- self.model: MMFTransformer loaded from checkpoint
- self.config: OmegaConf config for processors
- self.processors: Dict of MMF data processors (video, audio, text)
- self.activity_labels: List of activity label strings from CSV
Args:
context: TorchServe context with model_dir, manifest, etc.
"""
...
def preprocess(self, data):
"""
Process multimodal input data into an MMF SampleList.
Accepts video, audio, and text data from the request body.
Each modality is processed through its corresponding MMF processor.
Results are assembled into a SampleList for model consumption.
Args:
data (list): List of request dicts containing multimodal input.
Returns:
SampleList: MMF SampleList with processed video, audio, text tensors.
"""
...
def inference(self, data, *args, **kwargs):
"""
Run MMFTransformer forward pass on the SampleList.
Args:
data (SampleList): Preprocessed multimodal data.
Returns:
dict: Model output dict containing logits and predictions.
"""
...
def postprocess(self, data):
"""
Map model predictions to activity label strings.
Args:
data (dict): Model output dict with prediction indices.
Returns:
list: List of predicted activity label strings.
"""
...
I/O Contract
| Method |
Input |
Output |
Notes
|
initialize(context) |
TorchServe context with model artifacts |
None (sets self.model, self.config, self.processors, self.activity_labels) |
Loads MMF checkpoint, OmegaConf config, CSV labels
|
preprocess(data) |
List of request dicts with video/audio/text data |
SampleList with processed tensors |
Uses MMF processors for each modality
|
inference(data) |
SampleList |
Model output dict with logits |
Runs MMFTransformer forward pass
|
postprocess(data) |
Model output dict |
List of activity label strings |
Maps indices to CSV-loaded labels
|
Input Data Format
| Field |
Type |
Description
|
video |
bytes |
Raw video data (frames)
|
audio |
bytes |
Raw audio waveform data
|
text |
string |
Text description or transcript
|
Usage Examples
Example 1: Packaging the handler into a MAR
# Package the MMF handler with model artifacts
# torch-model-archiver --model-name mmf_activity \
# --version 1.0 \
# --handler examples/MMF-activity-recognition/handler.py \
# --extra-files "config.yaml,activity_labels.csv,checkpoint.pth" \
# --export-path model_store
Example 2: Sending a multimodal inference request
import requests
# Send multimodal data for activity recognition
with open("video.mp4", "rb") as video_file:
response = requests.post(
"http://localhost:8080/predictions/mmf_activity",
files={"data": video_file},
)
print(response.json())
# Output: ["playing basketball"]
Example 3: Handler initialization flow
# During initialize(), the handler:
# 1. Calls super().initialize(context) for base setup
# 2. Loads OmegaConf config from extra files
# 3. Instantiates MMF processors for video, audio, text
# 4. Reads activity_labels.csv with pandas
# 5. Loads MMFTransformer model from checkpoint
import pandas as pd
from omegaconf import OmegaConf
# Activity labels loaded as:
labels_df = pd.read_csv("activity_labels.csv")
activity_labels = labels_df["label"].tolist()
# e.g., ["playing basketball", "cooking", "dancing", ...]
Related Pages