Principle:Pytorch Serve Accelerate Device Mapping
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Accelerate Device Mapping |
| Domains | Distributed_Computing, Model_Serving |
| Knowledge Sources | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Automatic device mapping with HuggingFace Accelerate provides the simplest approach to serving models that exceed single GPU memory. Instead of requiring explicit parallelism configuration and multi-process coordination, Accelerate analyzes the model's layer structure and available device memory, then automatically distributes layers across GPUs (and optionally CPU RAM or disk) to fit the entire model. This single-process approach runs within a standard TorchServe worker without requiring torchrun.
Description
HuggingFace Accelerate's device mapping feature works by:
- Analyzing the model architecture: Accelerate inspects the model's module tree to determine the memory footprint of each layer or module group.
- Querying available memory: It checks free memory on each GPU, available CPU RAM, and optionally disk space.
- Computing an assignment: Using a greedy algorithm, it assigns layers to devices starting from GPU 0, filling each device until its memory limit is approached, then spilling to the next GPU, then to CPU, and finally to disk if needed.
- Setting up hooks: Accelerate installs forward hooks on each module that automatically move inputs to the correct device before execution and move outputs to the next device as needed.
The key advantages of this approach are:
- No multi-process overhead: The model runs in a single process, avoiding the complexity of inter-process communication, rank management, and synchronization.
- No torchrun required: The handler runs as a standard TorchServe worker. Set
parallelType: "custom"or omit it entirely. - Automatic memory management: The device map is computed based on actual available memory, adapting to different hardware configurations without manual tuning.
- CPU and disk offloading: For models that exceed total GPU memory, layers can be offloaded to CPU RAM or disk, trading latency for the ability to serve extremely large models.
The trade-offs compared to tensor and pipeline parallelism:
- Lower throughput: Since the model runs in a single process, there is no inter-GPU parallelism within a single forward pass. Layers execute sequentially, with data moving between GPUs as needed.
- Higher latency for multi-GPU: Each forward pass involves sequential execution across devices with explicit tensor transfers between them.
- Simpler configuration: No DeepSpeed config, no RPC setup, no microbatching -- just
device_map="auto".
Usage
To use Accelerate device mapping in TorchServe:
- Create a custom handler that inherits from
BaseHandler(not a distributed handler). - In
initialize(), load the HuggingFace model withdevice_map="auto"andlow_cpu_mem_usage=True. - Optionally specify
max_memoryto control per-device memory limits. - Optionally specify
offload_folderfor disk offloading. - Configure
model-config.yamlwithparallelType: "custom"or omit the parallelType setting. - Package and deploy via
torch-model-archiver.
Configuration is typically provided through a setup_config.json file bundled with the model archive, containing:
device_map: Set to "auto" for automatic mapping.low_cpu_mem_usage: Set totrueto minimize CPU memory during loading.max_memory: Dictionary mapping device IDs to memory limits (e.g.,{"0": "10GiB", "1": "10GiB", "cpu": "30GiB"}).offload_folder: Directory for disk offloading if needed.
Theoretical Basis
Automatic device mapping is a form of layer-wise model parallelism that operates at the module granularity. Unlike tensor parallelism (which shards individual weight matrices) or pipeline parallelism (which uses microbatching for overlap), device mapping assigns entire modules to devices and executes them sequentially.
The greedy bin-packing algorithm used by Accelerate treats each GPU (and CPU, and disk) as a bin with a capacity equal to available memory. Modules are placed in order of their position in the model, filling the first device until it would exceed capacity, then moving to the next device. This preserves the natural execution order and minimizes the number of cross-device transfers.
The forward hooks mechanism is central to how device mapping works transparently. Each module that has been assigned to a different device than its predecessor gets a forward pre-hook that:
- Moves all input tensors to the module's assigned device.
- If the module's weights are offloaded to CPU or disk, loads them to the target device.
And a forward post-hook that:
- Optionally offloads weights back to CPU or disk to free GPU memory.
This approach trades latency (due to data movement) for memory flexibility. The theoretical peak throughput is limited to what a single process can achieve, but it enables serving models that would otherwise require a more complex distributed setup.
Related Pages
- Implementation:Pytorch_Serve_Accelerate_Handler - HuggingFace Accelerate handler implementation
- Pytorch_Serve_Parallelism_Strategy - Choosing between parallelism strategies
- Pytorch_Serve_Distributed_Configuration - Configuration of distributed serving