Workflow:Pytorch Serve HuggingFace Transformer Serving
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Serving, Transformers |
| Last Updated | 2026-02-13 18:00 GMT |
Overview
End-to-end process for serving HuggingFace Transformer models (BERT, RoBERTa, GPT-2, DistilBERT) on TorchServe for sequence classification, token classification, question answering, and text generation tasks.
Description
This workflow demonstrates how to deploy pre-trained or fine-tuned HuggingFace Transformer models using TorchServe's generalized Transformer handler. The handler supports multiple NLP tasks through a unified interface, with optional torch.compile acceleration, BetterTransformer (Flash Attention/Xformer) kernel optimizations, Captum explanations, and batch inference. Models can be served in either eager mode (with safetensors) or TorchScript format.
Usage
Execute this workflow when you have a HuggingFace Transformer model (pre-trained or fine-tuned) that you need to serve for NLP tasks such as sentiment analysis, named entity recognition, question answering, or text generation. This is the recommended path for any HuggingFace model deployment on TorchServe.
Execution Steps
Step 1: Prepare the Transformer Model
Either fine-tune a model and save it with save_pretrained(), or download a pre-trained model using the provided helper script. The script reads configuration from model-config.yaml and saves model weights, vocabulary, and config files to a local directory.
Key considerations:
- Fine-tuned models must be saved with save_pretrained() to produce model.safetensors, vocab.txt, and config.json
- The Download_Transformer_models.py script automates downloading for pre-trained models
- Supported model types include bert-base-uncased, roberta-base, distilbert, gpt2, and other HuggingFace Hub models
Step 2: Configure the Handler
Set up model-config.yaml to specify the model name, task mode, tokenizer settings, and optimization flags. The configuration controls how the generalized Transformer handler loads and processes the model.
Key considerations:
- mode selects the NLP task: sequence_classification, token_classification, question_answering, or text_generation
- save_mode controls model format: "pretrained" for eager mode, "torchscript" for JIT-traced models
- Enable BetterTransformer: true for Flash Attention kernel optimizations (up to 4.5x speedup with batched padded inputs)
- Enable torch.compile via the pt2 section for inductor-based compilation
Step 3: Prepare Extra Files
Gather task-specific extra files required for inference: index_to_name.json for mapping prediction indices to labels, sample input files for testing, and any custom tokenizer configuration files.
Key considerations:
- Sequence classification requires index_to_name.json with class label mappings
- Token classification requires index_to_name.json with NER tag mappings
- Question answering does not require an index_to_name.json file
- Custom vocabularies require tokenizer_config.json, special_tokens_map.json, and merges.txt
Step 4: Create Model Archive
Package the model weights, handler, configuration, and extra files into a .mar archive using torch-model-archiver. The generalized Transformer handler (Transformer_handler_generalized.py) serves as the handler for all supported NLP tasks.
Pseudocode:
torch-model-archiver \ --model-name <ModelName> \ --version 1.0 \ --serialized-file Transformer_model/model.safetensors \ --handler Transformer_handler_generalized.py \ --config-file model-config.yaml \ --extra-files "Transformer_model/config.json,index_to_name.json"
Step 5: Start Server and Register Model
Move the archive to the model store, start TorchServe, and register the model. For batch inference, configure batch_size and max_batch_delay either through the Management API or via config.properties.
Key considerations:
- Register with initial_workers to control parallelism
- Set batch_size and max_batch_delay for automatic request batching
- Use --ncs flag for no config snapshot to start cleanly
- For model parallelism with GPT-2 on multi-GPU, set model_parallel: true in config and register with a single worker
Step 6: Run Inference and Explanations
Send prediction requests to the inference endpoint and optionally request Captum-based explanations. The handler automatically tokenizes input text, runs inference, and maps outputs to human-readable labels.
Key considerations:
- Predictions endpoint: POST /predictions/{model_name}
- Explanations endpoint: POST /explanations/{model_name} (requires captum_explanation: true in config)
- Captum explanations use LayerIntegratedGradients for interpretability
- Batch inference sends multiple concurrent requests that are automatically grouped