Workflow:Pytorch Serve HuggingFace Transformer Serving

Knowledge Sources	TorchServe HuggingFace Transformers TorchServe Custom Handlers
Domains	NLP, Model_Serving, Transformers
Last Updated	2026-02-13 18:00 GMT

Overview

End-to-end process for serving HuggingFace Transformer models (BERT, RoBERTa, GPT-2, DistilBERT) on TorchServe for sequence classification, token classification, question answering, and text generation tasks.

Description

This workflow demonstrates how to deploy pre-trained or fine-tuned HuggingFace Transformer models using TorchServe's generalized Transformer handler. The handler supports multiple NLP tasks through a unified interface, with optional torch.compile acceleration, BetterTransformer (Flash Attention/Xformer) kernel optimizations, Captum explanations, and batch inference. Models can be served in either eager mode (with safetensors) or TorchScript format.

Usage

Execute this workflow when you have a HuggingFace Transformer model (pre-trained or fine-tuned) that you need to serve for NLP tasks such as sentiment analysis, named entity recognition, question answering, or text generation. This is the recommended path for any HuggingFace model deployment on TorchServe.

Execution Steps

Step 1: Prepare the Transformer Model

Either fine-tune a model and save it with save_pretrained(), or download a pre-trained model using the provided helper script. The script reads configuration from model-config.yaml and saves model weights, vocabulary, and config files to a local directory.

Key considerations:

Fine-tuned models must be saved with save_pretrained() to produce model.safetensors, vocab.txt, and config.json
The Download_Transformer_models.py script automates downloading for pre-trained models
Supported model types include bert-base-uncased, roberta-base, distilbert, gpt2, and other HuggingFace Hub models

Step 2: Configure the Handler

Set up model-config.yaml to specify the model name, task mode, tokenizer settings, and optimization flags. The configuration controls how the generalized Transformer handler loads and processes the model.

Key considerations:

mode selects the NLP task: sequence_classification, token_classification, question_answering, or text_generation
save_mode controls model format: "pretrained" for eager mode, "torchscript" for JIT-traced models
Enable BetterTransformer: true for Flash Attention kernel optimizations (up to 4.5x speedup with batched padded inputs)
Enable torch.compile via the pt2 section for inductor-based compilation

Step 3: Prepare Extra Files

Gather task-specific extra files required for inference: index_to_name.json for mapping prediction indices to labels, sample input files for testing, and any custom tokenizer configuration files.

Key considerations:

Sequence classification requires index_to_name.json with class label mappings
Token classification requires index_to_name.json with NER tag mappings
Question answering does not require an index_to_name.json file
Custom vocabularies require tokenizer_config.json, special_tokens_map.json, and merges.txt

Step 4: Create Model Archive

Package the model weights, handler, configuration, and extra files into a .mar archive using torch-model-archiver. The generalized Transformer handler (Transformer_handler_generalized.py) serves as the handler for all supported NLP tasks.

Pseudocode:

torch-model-archiver \
  --model-name <ModelName> \
  --version 1.0 \
  --serialized-file Transformer_model/model.safetensors \
  --handler Transformer_handler_generalized.py \
  --config-file model-config.yaml \
  --extra-files "Transformer_model/config.json,index_to_name.json"

Step 5: Start Server and Register Model

Move the archive to the model store, start TorchServe, and register the model. For batch inference, configure batch_size and max_batch_delay either through the Management API or via config.properties.

Key considerations:

Register with initial_workers to control parallelism
Set batch_size and max_batch_delay for automatic request batching
Use --ncs flag for no config snapshot to start cleanly
For model parallelism with GPT-2 on multi-GPU, set model_parallel: true in config and register with a single worker

Step 6: Run Inference and Explanations

Send prediction requests to the inference endpoint and optionally request Captum-based explanations. The handler automatically tokenizes input text, runs inference, and maps outputs to human-readable labels.

Key considerations:

Predictions endpoint: POST /predictions/{model_name}
Explanations endpoint: POST /explanations/{model_name} (requires captum_explanation: true in config)
Captum explanations use LayerIntegratedGradients for interpretability
Batch inference sends multiple concurrent requests that are automatically grouped

Execution Diagram

GitHub URL

Workflow Repository