Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pytorch Serve Accelerate Device Mapping

From Leeroopedia
Field Value
Page Type Principle
Title Accelerate Device Mapping
Domains Distributed_Computing, Model_Serving
Knowledge Sources TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

Automatic device mapping with HuggingFace Accelerate provides the simplest approach to serving models that exceed single GPU memory. Instead of requiring explicit parallelism configuration and multi-process coordination, Accelerate analyzes the model's layer structure and available device memory, then automatically distributes layers across GPUs (and optionally CPU RAM or disk) to fit the entire model. This single-process approach runs within a standard TorchServe worker without requiring torchrun.

Description

HuggingFace Accelerate's device mapping feature works by:

  1. Analyzing the model architecture: Accelerate inspects the model's module tree to determine the memory footprint of each layer or module group.
  2. Querying available memory: It checks free memory on each GPU, available CPU RAM, and optionally disk space.
  3. Computing an assignment: Using a greedy algorithm, it assigns layers to devices starting from GPU 0, filling each device until its memory limit is approached, then spilling to the next GPU, then to CPU, and finally to disk if needed.
  4. Setting up hooks: Accelerate installs forward hooks on each module that automatically move inputs to the correct device before execution and move outputs to the next device as needed.

The key advantages of this approach are:

  • No multi-process overhead: The model runs in a single process, avoiding the complexity of inter-process communication, rank management, and synchronization.
  • No torchrun required: The handler runs as a standard TorchServe worker. Set parallelType: "custom" or omit it entirely.
  • Automatic memory management: The device map is computed based on actual available memory, adapting to different hardware configurations without manual tuning.
  • CPU and disk offloading: For models that exceed total GPU memory, layers can be offloaded to CPU RAM or disk, trading latency for the ability to serve extremely large models.

The trade-offs compared to tensor and pipeline parallelism:

  • Lower throughput: Since the model runs in a single process, there is no inter-GPU parallelism within a single forward pass. Layers execute sequentially, with data moving between GPUs as needed.
  • Higher latency for multi-GPU: Each forward pass involves sequential execution across devices with explicit tensor transfers between them.
  • Simpler configuration: No DeepSpeed config, no RPC setup, no microbatching -- just device_map="auto".

Usage

To use Accelerate device mapping in TorchServe:

  1. Create a custom handler that inherits from BaseHandler (not a distributed handler).
  2. In initialize(), load the HuggingFace model with device_map="auto" and low_cpu_mem_usage=True.
  3. Optionally specify max_memory to control per-device memory limits.
  4. Optionally specify offload_folder for disk offloading.
  5. Configure model-config.yaml with parallelType: "custom" or omit the parallelType setting.
  6. Package and deploy via torch-model-archiver.

Configuration is typically provided through a setup_config.json file bundled with the model archive, containing:

  • device_map: Set to "auto" for automatic mapping.
  • low_cpu_mem_usage: Set to true to minimize CPU memory during loading.
  • max_memory: Dictionary mapping device IDs to memory limits (e.g., {"0": "10GiB", "1": "10GiB", "cpu": "30GiB"}).
  • offload_folder: Directory for disk offloading if needed.

Theoretical Basis

Automatic device mapping is a form of layer-wise model parallelism that operates at the module granularity. Unlike tensor parallelism (which shards individual weight matrices) or pipeline parallelism (which uses microbatching for overlap), device mapping assigns entire modules to devices and executes them sequentially.

The greedy bin-packing algorithm used by Accelerate treats each GPU (and CPU, and disk) as a bin with a capacity equal to available memory. Modules are placed in order of their position in the model, filling the first device until it would exceed capacity, then moving to the next device. This preserves the natural execution order and minimizes the number of cross-device transfers.

The forward hooks mechanism is central to how device mapping works transparently. Each module that has been assigned to a different device than its predecessor gets a forward pre-hook that:

  1. Moves all input tensors to the module's assigned device.
  2. If the module's weights are offloaded to CPU or disk, loads them to the target device.

And a forward post-hook that:

  1. Optionally offloads weights back to CPU or disk to free GPU memory.

This approach trades latency (due to data movement) for memory flexibility. The theoretical peak throughput is limited to what a single process can achieve, but it enables serving models that would otherwise require a more complex distributed setup.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment