Implementation:Pytorch Serve Transformer Handler Config

Field	Value
Page Type	Implementation
Title	Transformer Handler Config
Type	Pattern Doc
Short Description	YAML configuration file that controls the HuggingFace Transformer handler behavior, including model name, NLP task mode, tokenization settings, optimizations, and PyTorch 2.x compilation
Domains	NLP, Configuration
Source	examples/Huggingface_Transformers/model-config.yaml:L1-18
Knowledge Sources	TorchServe
Workflow	HuggingFace_Transformer_Serving
Last Updated	2026-02-13 00:00 GMT

Overview

The model-config.yaml file is the central configuration artifact for the HuggingFace Transformer serving workflow. It defines the NLP task mode, model identity, tokenization parameters, serialization format, explainability settings, and PyTorch 2.x compilation options. This single file is consumed by both the model download script and the serving handler, making it the authoritative specification for how a model is prepared and served.

Description

The configuration file is structured into three top-level sections: worker scaling parameters, the handler block for NLP-specific settings, and the pt2 block for torch.compile settings.

Usage

This file is used in two contexts:

As input to Download_Transformer_models.py (read via yaml.safe_load)
As the model configuration bundled into the .mar archive and accessed via ctx.model_yaml_config in the handler

Code Reference

Source Location

Field	Value
Repository	pytorch/serve
File	`examples/Huggingface_Transformers/model-config.yaml`
Lines	L1-18

Full Configuration

minWorkers: 1
maxWorkers: 1
handler:
  model_name: bert-base-uncased
  mode: sequence_classification
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 150
  captum_explanation: true
  embedding_name: bert
  BetterTransformer: false
  model_parallel: false
pt2:
  compile:
    enable: True
    backend: inductor
    mode: reduce-overhead

Parameter Reference

Parameter	Type	Default	Description
minWorkers	int	1	Minimum number of TorchServe worker processes
maxWorkers	int	1	Maximum number of TorchServe worker processes
handler.model_name	string	bert-base-uncased	HuggingFace model identifier used for downloading and tokenizer loading
handler.mode	string	sequence_classification	NLP task mode; one of `sequence_classification`, `token_classification`, `question_answering`, `text_generation`
handler.do_lower_case	boolean	true	Whether the tokenizer lowercases input text
handler.num_labels	integer	2	Number of output labels for classification heads
handler.save_mode	string	pretrained	Serialization format: `pretrained` (HuggingFace native) or `torchscript` (traced)
handler.max_length	integer	150	Maximum token sequence length for padding and truncation
handler.captum_explanation	boolean	true	Enable Captum Layer Integrated Gradients for model explanations
handler.embedding_name	string	bert	Name of the model's embedding attribute accessed via `getattr(model, embedding_name)`
handler.BetterTransformer	boolean	false	Apply HuggingFace Optimum BetterTransformer fused attention optimization
handler.model_parallel	boolean	false	Enable GPT-2 model parallelism across multiple GPUs
pt2.compile.enable	boolean	True	Whether to apply `torch.compile()` to the model
pt2.compile.backend	string	inductor	The `torch.compile` backend (e.g., `inductor`, `cudagraphs`)
pt2.compile.mode	string	reduce-overhead	The `torch.compile` mode (e.g., `reduce-overhead`, `max-autotune`)

Import

The YAML file is loaded using:

import yaml

f = open(filename)
model_yaml_config = yaml.safe_load(f)
settings = model_yaml_config["handler"]

In the handler, it is accessed via the TorchServe context:

self.model_yaml_config = ctx.model_yaml_config
self.setup_config = self.model_yaml_config.get("handler", {})

I/O Contract

Input

Input	Format	Description
YAML file	YAML	A valid YAML file with `handler` and optional `pt2` sections

Output

Consumer	How Used
Download_Transformer_models.py	Reads `handler` section to determine model, mode, and serialization settings
TransformersSeqClassifierHandler.initialize()	Reads `handler` section for model loading, tokenization, and optimization settings
TransformersSeqClassifierHandler.initialize()	Reads `pt2` section for `torch.compile` configuration
TransformersSeqClassifierHandler.preprocess()	Reads `handler.mode` and `handler.max_length` for tokenization
TransformersSeqClassifierHandler.inference()	Reads `handler.mode` for task-specific inference branching
TransformersSeqClassifierHandler.get_insights()	Reads `handler.captum_explanation` and `handler.embedding_name`

Usage Examples

Example 1: Sequence Classification with BERT

handler:
  model_name: bert-base-uncased
  mode: sequence_classification
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 150
  captum_explanation: true
  embedding_name: bert
  BetterTransformer: false
  model_parallel: false

Example 2: Question Answering with BERT-Large

handler:
  model_name: bert-large-uncased-whole-word-masking-finetuned-squad
  mode: question_answering
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 128
  captum_explanation: false
  embedding_name: bert
  BetterTransformer: false
  model_parallel: false

Example 3: Text Generation with GPT-2 (Model Parallel)

handler:
  model_name: gpt2-large
  mode: text_generation
  do_lower_case: false
  num_labels: 1
  save_mode: pretrained
  max_length: 50
  captum_explanation: false
  embedding_name: gpt2
  BetterTransformer: false
  model_parallel: true

Example 4: With torch.compile Disabled

handler:
  model_name: bert-base-uncased
  mode: sequence_classification
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 150
  captum_explanation: false
  embedding_name: bert
  BetterTransformer: true
  model_parallel: false
pt2:
  compile:
    enable: False

Related Pages

Principle:Pytorch_Serve_Transformer_Configuration - The principle of externalizing handler behavior into declarative configuration
Implementation:Pytorch_Serve_Transformers_Model_Dowloader - The download script that reads this configuration
Implementation:Pytorch_Serve_TransformersSeqClassifierHandler - The handler that uses this configuration at runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment