Implementation:Pytorch Serve Transformer Handler Config
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Transformer Handler Config |
| Type | Pattern Doc |
| Short Description | YAML configuration file that controls the HuggingFace Transformer handler behavior, including model name, NLP task mode, tokenization settings, optimizations, and PyTorch 2.x compilation |
| Domains | NLP, Configuration |
| Source | examples/Huggingface_Transformers/model-config.yaml:L1-18 |
| Knowledge Sources | TorchServe |
| Workflow | HuggingFace_Transformer_Serving |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
The model-config.yaml file is the central configuration artifact for the HuggingFace Transformer serving workflow. It defines the NLP task mode, model identity, tokenization parameters, serialization format, explainability settings, and PyTorch 2.x compilation options. This single file is consumed by both the model download script and the serving handler, making it the authoritative specification for how a model is prepared and served.
Description
The configuration file is structured into three top-level sections: worker scaling parameters, the handler block for NLP-specific settings, and the pt2 block for torch.compile settings.
Usage
This file is used in two contexts:
- As input to
Download_Transformer_models.py(read viayaml.safe_load) - As the model configuration bundled into the
.mararchive and accessed viactx.model_yaml_configin the handler
Code Reference
Source Location
| Field | Value |
|---|---|
| Repository | pytorch/serve |
| File | examples/Huggingface_Transformers/model-config.yaml
|
| Lines | L1-18 |
Full Configuration
minWorkers: 1
maxWorkers: 1
handler:
model_name: bert-base-uncased
mode: sequence_classification
do_lower_case: true
num_labels: 2
save_mode: pretrained
max_length: 150
captum_explanation: true
embedding_name: bert
BetterTransformer: false
model_parallel: false
pt2:
compile:
enable: True
backend: inductor
mode: reduce-overhead
Parameter Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
| minWorkers | int | 1 | Minimum number of TorchServe worker processes |
| maxWorkers | int | 1 | Maximum number of TorchServe worker processes |
| handler.model_name | string | bert-base-uncased | HuggingFace model identifier used for downloading and tokenizer loading |
| handler.mode | string | sequence_classification | NLP task mode; one of sequence_classification, token_classification, question_answering, text_generation
|
| handler.do_lower_case | boolean | true | Whether the tokenizer lowercases input text |
| handler.num_labels | integer | 2 | Number of output labels for classification heads |
| handler.save_mode | string | pretrained | Serialization format: pretrained (HuggingFace native) or torchscript (traced)
|
| handler.max_length | integer | 150 | Maximum token sequence length for padding and truncation |
| handler.captum_explanation | boolean | true | Enable Captum Layer Integrated Gradients for model explanations |
| handler.embedding_name | string | bert | Name of the model's embedding attribute accessed via getattr(model, embedding_name)
|
| handler.BetterTransformer | boolean | false | Apply HuggingFace Optimum BetterTransformer fused attention optimization |
| handler.model_parallel | boolean | false | Enable GPT-2 model parallelism across multiple GPUs |
| pt2.compile.enable | boolean | True | Whether to apply torch.compile() to the model
|
| pt2.compile.backend | string | inductor | The torch.compile backend (e.g., inductor, cudagraphs)
|
| pt2.compile.mode | string | reduce-overhead | The torch.compile mode (e.g., reduce-overhead, max-autotune)
|
Import
The YAML file is loaded using:
import yaml
f = open(filename)
model_yaml_config = yaml.safe_load(f)
settings = model_yaml_config["handler"]
In the handler, it is accessed via the TorchServe context:
self.model_yaml_config = ctx.model_yaml_config
self.setup_config = self.model_yaml_config.get("handler", {})
I/O Contract
Input
| Input | Format | Description |
|---|---|---|
| YAML file | YAML | A valid YAML file with handler and optional pt2 sections
|
Output
| Consumer | How Used |
|---|---|
| Download_Transformer_models.py | Reads handler section to determine model, mode, and serialization settings
|
| TransformersSeqClassifierHandler.initialize() | Reads handler section for model loading, tokenization, and optimization settings
|
| TransformersSeqClassifierHandler.initialize() | Reads pt2 section for torch.compile configuration
|
| TransformersSeqClassifierHandler.preprocess() | Reads handler.mode and handler.max_length for tokenization
|
| TransformersSeqClassifierHandler.inference() | Reads handler.mode for task-specific inference branching
|
| TransformersSeqClassifierHandler.get_insights() | Reads handler.captum_explanation and handler.embedding_name
|
Usage Examples
Example 1: Sequence Classification with BERT
handler:
model_name: bert-base-uncased
mode: sequence_classification
do_lower_case: true
num_labels: 2
save_mode: pretrained
max_length: 150
captum_explanation: true
embedding_name: bert
BetterTransformer: false
model_parallel: false
Example 2: Question Answering with BERT-Large
handler:
model_name: bert-large-uncased-whole-word-masking-finetuned-squad
mode: question_answering
do_lower_case: true
num_labels: 2
save_mode: pretrained
max_length: 128
captum_explanation: false
embedding_name: bert
BetterTransformer: false
model_parallel: false
Example 3: Text Generation with GPT-2 (Model Parallel)
handler:
model_name: gpt2-large
mode: text_generation
do_lower_case: false
num_labels: 1
save_mode: pretrained
max_length: 50
captum_explanation: false
embedding_name: gpt2
BetterTransformer: false
model_parallel: true
Example 4: With torch.compile Disabled
handler:
model_name: bert-base-uncased
mode: sequence_classification
do_lower_case: true
num_labels: 2
save_mode: pretrained
max_length: 150
captum_explanation: false
embedding_name: bert
BetterTransformer: true
model_parallel: false
pt2:
compile:
enable: False
Related Pages
- Principle:Pytorch_Serve_Transformer_Configuration - The principle of externalizing handler behavior into declarative configuration
- Implementation:Pytorch_Serve_Transformers_Model_Dowloader - The download script that reads this configuration
- Implementation:Pytorch_Serve_TransformersSeqClassifierHandler - The handler that uses this configuration at runtime