Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pytorch Serve Transformer Handler Config

From Leeroopedia
Field Value
Page Type Implementation
Title Transformer Handler Config
Type Pattern Doc
Short Description YAML configuration file that controls the HuggingFace Transformer handler behavior, including model name, NLP task mode, tokenization settings, optimizations, and PyTorch 2.x compilation
Domains NLP, Configuration
Source examples/Huggingface_Transformers/model-config.yaml:L1-18
Knowledge Sources TorchServe
Workflow HuggingFace_Transformer_Serving
Last Updated 2026-02-13 00:00 GMT

Overview

The model-config.yaml file is the central configuration artifact for the HuggingFace Transformer serving workflow. It defines the NLP task mode, model identity, tokenization parameters, serialization format, explainability settings, and PyTorch 2.x compilation options. This single file is consumed by both the model download script and the serving handler, making it the authoritative specification for how a model is prepared and served.

Description

The configuration file is structured into three top-level sections: worker scaling parameters, the handler block for NLP-specific settings, and the pt2 block for torch.compile settings.

Usage

This file is used in two contexts:

  1. As input to Download_Transformer_models.py (read via yaml.safe_load)
  2. As the model configuration bundled into the .mar archive and accessed via ctx.model_yaml_config in the handler

Code Reference

Source Location

Field Value
Repository pytorch/serve
File examples/Huggingface_Transformers/model-config.yaml
Lines L1-18

Full Configuration

minWorkers: 1
maxWorkers: 1
handler:
  model_name: bert-base-uncased
  mode: sequence_classification
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 150
  captum_explanation: true
  embedding_name: bert
  BetterTransformer: false
  model_parallel: false
pt2:
  compile:
    enable: True
    backend: inductor
    mode: reduce-overhead

Parameter Reference

Parameter Type Default Description
minWorkers int 1 Minimum number of TorchServe worker processes
maxWorkers int 1 Maximum number of TorchServe worker processes
handler.model_name string bert-base-uncased HuggingFace model identifier used for downloading and tokenizer loading
handler.mode string sequence_classification NLP task mode; one of sequence_classification, token_classification, question_answering, text_generation
handler.do_lower_case boolean true Whether the tokenizer lowercases input text
handler.num_labels integer 2 Number of output labels for classification heads
handler.save_mode string pretrained Serialization format: pretrained (HuggingFace native) or torchscript (traced)
handler.max_length integer 150 Maximum token sequence length for padding and truncation
handler.captum_explanation boolean true Enable Captum Layer Integrated Gradients for model explanations
handler.embedding_name string bert Name of the model's embedding attribute accessed via getattr(model, embedding_name)
handler.BetterTransformer boolean false Apply HuggingFace Optimum BetterTransformer fused attention optimization
handler.model_parallel boolean false Enable GPT-2 model parallelism across multiple GPUs
pt2.compile.enable boolean True Whether to apply torch.compile() to the model
pt2.compile.backend string inductor The torch.compile backend (e.g., inductor, cudagraphs)
pt2.compile.mode string reduce-overhead The torch.compile mode (e.g., reduce-overhead, max-autotune)

Import

The YAML file is loaded using:

import yaml

f = open(filename)
model_yaml_config = yaml.safe_load(f)
settings = model_yaml_config["handler"]

In the handler, it is accessed via the TorchServe context:

self.model_yaml_config = ctx.model_yaml_config
self.setup_config = self.model_yaml_config.get("handler", {})

I/O Contract

Input

Input Format Description
YAML file YAML A valid YAML file with handler and optional pt2 sections

Output

Consumer How Used
Download_Transformer_models.py Reads handler section to determine model, mode, and serialization settings
TransformersSeqClassifierHandler.initialize() Reads handler section for model loading, tokenization, and optimization settings
TransformersSeqClassifierHandler.initialize() Reads pt2 section for torch.compile configuration
TransformersSeqClassifierHandler.preprocess() Reads handler.mode and handler.max_length for tokenization
TransformersSeqClassifierHandler.inference() Reads handler.mode for task-specific inference branching
TransformersSeqClassifierHandler.get_insights() Reads handler.captum_explanation and handler.embedding_name

Usage Examples

Example 1: Sequence Classification with BERT

handler:
  model_name: bert-base-uncased
  mode: sequence_classification
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 150
  captum_explanation: true
  embedding_name: bert
  BetterTransformer: false
  model_parallel: false

Example 2: Question Answering with BERT-Large

handler:
  model_name: bert-large-uncased-whole-word-masking-finetuned-squad
  mode: question_answering
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 128
  captum_explanation: false
  embedding_name: bert
  BetterTransformer: false
  model_parallel: false

Example 3: Text Generation with GPT-2 (Model Parallel)

handler:
  model_name: gpt2-large
  mode: text_generation
  do_lower_case: false
  num_labels: 1
  save_mode: pretrained
  max_length: 50
  captum_explanation: false
  embedding_name: gpt2
  BetterTransformer: false
  model_parallel: true

Example 4: With torch.compile Disabled

handler:
  model_name: bert-base-uncased
  mode: sequence_classification
  do_lower_case: true
  num_labels: 2
  save_mode: pretrained
  max_length: 150
  captum_explanation: false
  embedding_name: bert
  BetterTransformer: true
  model_parallel: false
pt2:
  compile:
    enable: False

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment