Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve Accelerate Handler

From Leeroopedia
Field Value
Page Type Implementation (Wrapper Doc)
Title Accelerate Handler
Implements Principle:Pytorch_Serve_Accelerate_Device_Mapping
Source examples/large_models/Huggingface_accelerate/custom_handler.py
Repository TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

The Accelerate handler demonstrates how to serve large HuggingFace models in TorchServe using automatic device mapping via the accelerate library. The handler is a BaseHandler subclass that loads a model with device_map="auto" and low_cpu_mem_usage=True, allowing the model to be automatically distributed across available GPUs, CPU, and optionally disk. This approach runs in a single process and does not require torchrun or any distributed handler base class.

Description

The Accelerate handler pattern consists of a single handler class that:

1. Reads configuration from a setup_config.json file in the model directory, which specifies device mapping options, memory limits, offloading settings, and data types.

2. Loads the model using HuggingFace's from_pretrained() with Accelerate-specific parameters:

  • device_map: Set to "auto" for automatic layer-to-device assignment
  • low_cpu_mem_usage=True: Minimizes CPU memory usage during loading
  • max_memory: Per-device memory limits (e.g., {"0": "10GiB", "cpu": "30GiB"})
  • offload_folder: Directory for weight offloading to disk
  • offload_state_dict: Whether to offload the state dict during loading
  • torch_dtype: Data type for model weights (float16, float32, etc.)

3. Runs inference in a single process with the model distributed across devices. Accelerate handles tensor movement between devices transparently during the forward pass.

The example uses BloomForCausalLM and BloomTokenizerFast, but the pattern works with any HuggingFace model that supports device_map.

Usage

Code Reference

Source Location: examples/large_models/Huggingface_accelerate/custom_handler.py (lines 24-164)

Signature:

class TransformersSeqClassifierHandler(BaseHandler, ABC):
    """
    Transformers handler class for sequence, token classification
    and question answering.
    """

    def __init__(self):
        super(TransformersSeqClassifierHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")

        self.device = torch.device(
            "cuda:" + str(properties.get("gpu_id"))
            if torch.cuda.is_available() and properties.get("gpu_id") is not None
            else "cpu"
        )

        # Load setup_config.json for device_map and memory settings
        setup_config_path = os.path.join(model_dir, "setup_config.json")
        if os.path.isfile(setup_config_path):
            with open(setup_config_path) as setup_config_file:
                self.setup_config = json.load(setup_config_file)

        self.model = BloomForCausalLM.from_pretrained(
            model_dir + "/model",
            revision=self.setup_config["revision"],
            max_memory={
                int(key) if key.isnumeric() else key: value
                for key, value in self.setup_config["max_memory"].items()
            },
            low_cpu_mem_usage=self.setup_config["low_cpu_mem_usage"],
            device_map=self.setup_config["device_map"],
            offload_folder=self.setup_config["offload_folder"],
            offload_state_dict=self.setup_config["offload_state_dict"],
            torch_dtype=TORCH_DTYPES[self.setup_config["torch_dtype"]],
        )

        self.tokenizer = BloomTokenizerFast.from_pretrained(
            model_dir + "/model", return_tensors="pt"
        )
        self.model.eval()
        self.initialized = True

Import:

import json
import os
import torch
from abc import ABC
from transformers import BloomForCausalLM, BloomTokenizerFast
from ts.torch_handler.base_handler import BaseHandler

External Dependencies:

  • accelerate (used implicitly by HuggingFace from_pretrained when device_map is specified)
  • transformers (HuggingFace model and tokenizer classes)

I/O Contract

Inputs to initialize():

  • ctx (Context): TorchServe context object containing:
    • ctx.system_properties["model_dir"] (str): Path to extracted model archive
    • ctx.system_properties["gpu_id"] (int or None): Primary GPU ID

setup_config.json parameters:

Parameter Type Description
device_map str Device mapping strategy. "auto" for automatic.
low_cpu_mem_usage bool Minimize CPU memory during model loading.
max_memory dict Per-device memory limits, e.g., {"0": "10GiB", "cpu": "30GiB"}.
offload_folder str Directory for disk offloading of weights.
offload_state_dict bool Whether to offload state dict during loading.
torch_dtype str Data type for model weights: "float16", "float32", "float64".
revision str Model revision/commit hash.
max_length int Maximum token length for tokenizer.

Inputs to preprocess():

  • requests (list[dict]): List of request dictionaries with "data" or "body" field containing input text.

Output of preprocess():

  • Tuple of (input_ids_batch, attention_mask_batch) tensors on self.device.

Inputs to inference():

  • input_batch (tuple): Tuple of (input_ids_batch, attention_mask_batch) tensors.

Output of inference():

  • list[str]: List of decoded generated text strings.

Usage Examples

setup_config.json:

{
    "revision": "main",
    "max_memory": {
        "0": "10GiB",
        "1": "10GiB",
        "cpu": "30GiB"
    },
    "low_cpu_mem_usage": true,
    "device_map": "auto",
    "offload_folder": "offload",
    "offload_state_dict": true,
    "torch_dtype": "float16",
    "max_length": 50
}

Packaging the model:

torch-model-archiver --model-name bloom \
    --version 1.0 \
    --handler custom_handler.py \
    --extra-files model.zip,setup_config.json \
    -r requirements.txt \
    --config-file model-config.yaml \
    --archive-format tgz

Minimal model-config.yaml (no torchrun needed):

minWorkers: 1
maxWorkers: 1
responseTimeout: 300
deviceType: "gpu"
parallelType: "custom"

Inference call:

curl http://localhost:8080/predictions/bloom -T input.txt

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment