Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Pytorch Serve HuggingFace Transformer Serving

From Leeroopedia
Knowledge Sources
Domains NLP, Model_Serving, Transformers
Last Updated 2026-02-13 18:00 GMT

Overview

End-to-end process for serving HuggingFace Transformer models (BERT, RoBERTa, GPT-2, DistilBERT) on TorchServe for sequence classification, token classification, question answering, and text generation tasks.

Description

This workflow demonstrates how to deploy pre-trained or fine-tuned HuggingFace Transformer models using TorchServe's generalized Transformer handler. The handler supports multiple NLP tasks through a unified interface, with optional torch.compile acceleration, BetterTransformer (Flash Attention/Xformer) kernel optimizations, Captum explanations, and batch inference. Models can be served in either eager mode (with safetensors) or TorchScript format.

Usage

Execute this workflow when you have a HuggingFace Transformer model (pre-trained or fine-tuned) that you need to serve for NLP tasks such as sentiment analysis, named entity recognition, question answering, or text generation. This is the recommended path for any HuggingFace model deployment on TorchServe.

Execution Steps

Step 1: Prepare the Transformer Model

Either fine-tune a model and save it with save_pretrained(), or download a pre-trained model using the provided helper script. The script reads configuration from model-config.yaml and saves model weights, vocabulary, and config files to a local directory.

Key considerations:

  • Fine-tuned models must be saved with save_pretrained() to produce model.safetensors, vocab.txt, and config.json
  • The Download_Transformer_models.py script automates downloading for pre-trained models
  • Supported model types include bert-base-uncased, roberta-base, distilbert, gpt2, and other HuggingFace Hub models

Step 2: Configure the Handler

Set up model-config.yaml to specify the model name, task mode, tokenizer settings, and optimization flags. The configuration controls how the generalized Transformer handler loads and processes the model.

Key considerations:

  • mode selects the NLP task: sequence_classification, token_classification, question_answering, or text_generation
  • save_mode controls model format: "pretrained" for eager mode, "torchscript" for JIT-traced models
  • Enable BetterTransformer: true for Flash Attention kernel optimizations (up to 4.5x speedup with batched padded inputs)
  • Enable torch.compile via the pt2 section for inductor-based compilation

Step 3: Prepare Extra Files

Gather task-specific extra files required for inference: index_to_name.json for mapping prediction indices to labels, sample input files for testing, and any custom tokenizer configuration files.

Key considerations:

  • Sequence classification requires index_to_name.json with class label mappings
  • Token classification requires index_to_name.json with NER tag mappings
  • Question answering does not require an index_to_name.json file
  • Custom vocabularies require tokenizer_config.json, special_tokens_map.json, and merges.txt

Step 4: Create Model Archive

Package the model weights, handler, configuration, and extra files into a .mar archive using torch-model-archiver. The generalized Transformer handler (Transformer_handler_generalized.py) serves as the handler for all supported NLP tasks.

Pseudocode:

torch-model-archiver \
  --model-name <ModelName> \
  --version 1.0 \
  --serialized-file Transformer_model/model.safetensors \
  --handler Transformer_handler_generalized.py \
  --config-file model-config.yaml \
  --extra-files "Transformer_model/config.json,index_to_name.json"

Step 5: Start Server and Register Model

Move the archive to the model store, start TorchServe, and register the model. For batch inference, configure batch_size and max_batch_delay either through the Management API or via config.properties.

Key considerations:

  • Register with initial_workers to control parallelism
  • Set batch_size and max_batch_delay for automatic request batching
  • Use --ncs flag for no config snapshot to start cleanly
  • For model parallelism with GPT-2 on multi-GPU, set model_parallel: true in config and register with a single worker

Step 6: Run Inference and Explanations

Send prediction requests to the inference endpoint and optionally request Captum-based explanations. The handler automatically tokenizes input text, runs inference, and maps outputs to human-readable labels.

Key considerations:

  • Predictions endpoint: POST /predictions/{model_name}
  • Explanations endpoint: POST /explanations/{model_name} (requires captum_explanation: true in config)
  • Captum explanations use LayerIntegratedGradients for interpretability
  • Batch inference sends multiple concurrent requests that are automatically grouped

Execution Diagram

GitHub URL

Workflow Repository