Implementation:Pytorch Serve Accelerate Handler
| Field | Value |
|---|---|
| Page Type | Implementation (Wrapper Doc) |
| Title | Accelerate Handler |
| Implements | Principle:Pytorch_Serve_Accelerate_Device_Mapping |
| Source | examples/large_models/Huggingface_accelerate/custom_handler.py
|
| Repository | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
The Accelerate handler demonstrates how to serve large HuggingFace models in TorchServe using automatic device mapping via the accelerate library. The handler is a BaseHandler subclass that loads a model with device_map="auto" and low_cpu_mem_usage=True, allowing the model to be automatically distributed across available GPUs, CPU, and optionally disk. This approach runs in a single process and does not require torchrun or any distributed handler base class.
Description
The Accelerate handler pattern consists of a single handler class that:
1. Reads configuration from a setup_config.json file in the model directory, which specifies device mapping options, memory limits, offloading settings, and data types.
2. Loads the model using HuggingFace's from_pretrained() with Accelerate-specific parameters:
device_map: Set to "auto" for automatic layer-to-device assignmentlow_cpu_mem_usage=True: Minimizes CPU memory usage during loadingmax_memory: Per-device memory limits (e.g.,{"0": "10GiB", "cpu": "30GiB"})offload_folder: Directory for weight offloading to diskoffload_state_dict: Whether to offload the state dict during loadingtorch_dtype: Data type for model weights (float16, float32, etc.)
3. Runs inference in a single process with the model distributed across devices. Accelerate handles tensor movement between devices transparently during the forward pass.
The example uses BloomForCausalLM and BloomTokenizerFast, but the pattern works with any HuggingFace model that supports device_map.
Usage
Code Reference
Source Location: examples/large_models/Huggingface_accelerate/custom_handler.py (lines 24-164)
Signature:
class TransformersSeqClassifierHandler(BaseHandler, ABC):
"""
Transformers handler class for sequence, token classification
and question answering.
"""
def __init__(self):
super(TransformersSeqClassifierHandler, self).__init__()
self.initialized = False
def initialize(self, ctx):
self.manifest = ctx.manifest
properties = ctx.system_properties
model_dir = properties.get("model_dir")
self.device = torch.device(
"cuda:" + str(properties.get("gpu_id"))
if torch.cuda.is_available() and properties.get("gpu_id") is not None
else "cpu"
)
# Load setup_config.json for device_map and memory settings
setup_config_path = os.path.join(model_dir, "setup_config.json")
if os.path.isfile(setup_config_path):
with open(setup_config_path) as setup_config_file:
self.setup_config = json.load(setup_config_file)
self.model = BloomForCausalLM.from_pretrained(
model_dir + "/model",
revision=self.setup_config["revision"],
max_memory={
int(key) if key.isnumeric() else key: value
for key, value in self.setup_config["max_memory"].items()
},
low_cpu_mem_usage=self.setup_config["low_cpu_mem_usage"],
device_map=self.setup_config["device_map"],
offload_folder=self.setup_config["offload_folder"],
offload_state_dict=self.setup_config["offload_state_dict"],
torch_dtype=TORCH_DTYPES[self.setup_config["torch_dtype"]],
)
self.tokenizer = BloomTokenizerFast.from_pretrained(
model_dir + "/model", return_tensors="pt"
)
self.model.eval()
self.initialized = True
Import:
import json
import os
import torch
from abc import ABC
from transformers import BloomForCausalLM, BloomTokenizerFast
from ts.torch_handler.base_handler import BaseHandler
External Dependencies:
accelerate(used implicitly by HuggingFacefrom_pretrainedwhendevice_mapis specified)transformers(HuggingFace model and tokenizer classes)
I/O Contract
Inputs to initialize():
ctx(Context): TorchServe context object containing:ctx.system_properties["model_dir"](str): Path to extracted model archivectx.system_properties["gpu_id"](int or None): Primary GPU ID
setup_config.json parameters:
| Parameter | Type | Description |
|---|---|---|
device_map |
str | Device mapping strategy. "auto" for automatic. |
low_cpu_mem_usage |
bool | Minimize CPU memory during model loading. |
max_memory |
dict | Per-device memory limits, e.g., {"0": "10GiB", "cpu": "30GiB"}.
|
offload_folder |
str | Directory for disk offloading of weights. |
offload_state_dict |
bool | Whether to offload state dict during loading. |
torch_dtype |
str | Data type for model weights: "float16", "float32", "float64". |
revision |
str | Model revision/commit hash. |
max_length |
int | Maximum token length for tokenizer. |
Inputs to preprocess():
requests(list[dict]): List of request dictionaries with "data" or "body" field containing input text.
Output of preprocess():
- Tuple of
(input_ids_batch, attention_mask_batch)tensors onself.device.
Inputs to inference():
input_batch(tuple): Tuple of (input_ids_batch, attention_mask_batch) tensors.
Output of inference():
list[str]: List of decoded generated text strings.
Usage Examples
setup_config.json:
{
"revision": "main",
"max_memory": {
"0": "10GiB",
"1": "10GiB",
"cpu": "30GiB"
},
"low_cpu_mem_usage": true,
"device_map": "auto",
"offload_folder": "offload",
"offload_state_dict": true,
"torch_dtype": "float16",
"max_length": 50
}
Packaging the model:
torch-model-archiver --model-name bloom \
--version 1.0 \
--handler custom_handler.py \
--extra-files model.zip,setup_config.json \
-r requirements.txt \
--config-file model-config.yaml \
--archive-format tgz
Minimal model-config.yaml (no torchrun needed):
minWorkers: 1
maxWorkers: 1
responseTimeout: 300
deviceType: "gpu"
parallelType: "custom"
Inference call:
curl http://localhost:8080/predictions/bloom -T input.txt
Related Pages
- Principle:Pytorch_Serve_Accelerate_Device_Mapping - Theory of automatic device mapping
- Pytorch_Serve_ParallelType_Config - ParallelType "custom" configuration
- Pytorch_Serve_BasePippyHandler - Alternative: PiPPy pipeline parallelism
- Pytorch_Serve_BaseDeepSpeedHandler - Alternative: DeepSpeed tensor parallelism
- Environment:Pytorch_Serve_CUDA_GPU_Environment - GPU environment for device mapping
- Distributed_Computing
- Inference