Implementation:Pytorch Serve Accelerate Handler

Field	Value
Page Type	Implementation (Wrapper Doc)
Title	Accelerate Handler
Implements	Principle:Pytorch_Serve_Accelerate_Device_Mapping
Source	`examples/large_models/Huggingface_accelerate/custom_handler.py`
Repository	TorchServe
Last Updated	2026-02-13 00:00 GMT

Overview

The Accelerate handler demonstrates how to serve large HuggingFace models in TorchServe using automatic device mapping via the accelerate library. The handler is a BaseHandler subclass that loads a model with device_map="auto" and low_cpu_mem_usage=True, allowing the model to be automatically distributed across available GPUs, CPU, and optionally disk. This approach runs in a single process and does not require torchrun or any distributed handler base class.

Description

The Accelerate handler pattern consists of a single handler class that:

1. Reads configuration from a setup_config.json file in the model directory, which specifies device mapping options, memory limits, offloading settings, and data types.

2. Loads the model using HuggingFace's from_pretrained() with Accelerate-specific parameters:

device_map: Set to "auto" for automatic layer-to-device assignment
low_cpu_mem_usage=True: Minimizes CPU memory usage during loading
max_memory: Per-device memory limits (e.g., {"0": "10GiB", "cpu": "30GiB"})
offload_folder: Directory for weight offloading to disk
offload_state_dict: Whether to offload the state dict during loading
torch_dtype: Data type for model weights (float16, float32, etc.)

3. Runs inference in a single process with the model distributed across devices. Accelerate handles tensor movement between devices transparently during the forward pass.

The example uses BloomForCausalLM and BloomTokenizerFast, but the pattern works with any HuggingFace model that supports device_map.

Usage

Code Reference

Source Location: examples/large_models/Huggingface_accelerate/custom_handler.py (lines 24-164)

Signature:

class TransformersSeqClassifierHandler(BaseHandler, ABC):
    """
    Transformers handler class for sequence, token classification
    and question answering.
    """

    def __init__(self):
        super(TransformersSeqClassifierHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")

        self.device = torch.device(
            "cuda:" + str(properties.get("gpu_id"))
            if torch.cuda.is_available() and properties.get("gpu_id") is not None
            else "cpu"
        )

        # Load setup_config.json for device_map and memory settings
        setup_config_path = os.path.join(model_dir, "setup_config.json")
        if os.path.isfile(setup_config_path):
            with open(setup_config_path) as setup_config_file:
                self.setup_config = json.load(setup_config_file)

        self.model = BloomForCausalLM.from_pretrained(
            model_dir + "/model",
            revision=self.setup_config["revision"],
            max_memory={
                int(key) if key.isnumeric() else key: value
                for key, value in self.setup_config["max_memory"].items()
            },
            low_cpu_mem_usage=self.setup_config["low_cpu_mem_usage"],
            device_map=self.setup_config["device_map"],
            offload_folder=self.setup_config["offload_folder"],
            offload_state_dict=self.setup_config["offload_state_dict"],
            torch_dtype=TORCH_DTYPES[self.setup_config["torch_dtype"]],
        )

        self.tokenizer = BloomTokenizerFast.from_pretrained(
            model_dir + "/model", return_tensors="pt"
        )
        self.model.eval()
        self.initialized = True

Import:

import json
import os
import torch
from abc import ABC
from transformers import BloomForCausalLM, BloomTokenizerFast
from ts.torch_handler.base_handler import BaseHandler

External Dependencies:

accelerate (used implicitly by HuggingFace from_pretrained when device_map is specified)
transformers (HuggingFace model and tokenizer classes)

I/O Contract

Inputs to initialize():

ctx (Context): TorchServe context object containing:
- ctx.system_properties["model_dir"] (str): Path to extracted model archive
- ctx.system_properties["gpu_id"] (int or None): Primary GPU ID

setup_config.json parameters:

Parameter	Type	Description
`device_map`	str	Device mapping strategy. "auto" for automatic.
`low_cpu_mem_usage`	bool	Minimize CPU memory during model loading.
`max_memory`	dict	Per-device memory limits, e.g., `{"0": "10GiB", "cpu": "30GiB"}`.
`offload_folder`	str	Directory for disk offloading of weights.
`offload_state_dict`	bool	Whether to offload state dict during loading.
`torch_dtype`	str	Data type for model weights: "float16", "float32", "float64".
`revision`	str	Model revision/commit hash.
`max_length`	int	Maximum token length for tokenizer.

Inputs to preprocess():

requests (list[dict]): List of request dictionaries with "data" or "body" field containing input text.

Output of preprocess():

Tuple of (input_ids_batch, attention_mask_batch) tensors on self.device.

Inputs to inference():

input_batch (tuple): Tuple of (input_ids_batch, attention_mask_batch) tensors.

Output of inference():

list[str]: List of decoded generated text strings.

Usage Examples

setup_config.json:

{
    "revision": "main",
    "max_memory": {
        "0": "10GiB",
        "1": "10GiB",
        "cpu": "30GiB"
    },
    "low_cpu_mem_usage": true,
    "device_map": "auto",
    "offload_folder": "offload",
    "offload_state_dict": true,
    "torch_dtype": "float16",
    "max_length": 50
}

Packaging the model:

torch-model-archiver --model-name bloom \
    --version 1.0 \
    --handler custom_handler.py \
    --extra-files model.zip,setup_config.json \
    -r requirements.txt \
    --config-file model-config.yaml \
    --archive-format tgz

Minimal model-config.yaml (no torchrun needed):

minWorkers: 1
maxWorkers: 1
responseTimeout: 300
deviceType: "gpu"
parallelType: "custom"

Inference call:

curl http://localhost:8080/predictions/bloom -T input.txt

Related Pages

Principle:Pytorch_Serve_Accelerate_Device_Mapping - Theory of automatic device mapping
Pytorch_Serve_ParallelType_Config - ParallelType "custom" configuration
Pytorch_Serve_BasePippyHandler - Alternative: PiPPy pipeline parallelism
Pytorch_Serve_BaseDeepSpeedHandler - Alternative: DeepSpeed tensor parallelism
Environment:Pytorch_Serve_CUDA_GPU_Environment - GPU environment for device mapping
Distributed_Computing
Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment