Heuristic:OpenGVLab InternVL Multi GPU ViT Device Mapping
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization, Deployment |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Multi-GPU device mapping strategy that treats the first GPU as half capacity (vit_alpha=0.5) because it hosts the ViT encoder, MLP projector, and embedding layers alongside LLM layers.
Description
When deploying InternVL models across multiple GPUs for inference, the device mapping algorithm accounts for the fact that GPU 0 hosts not only LLM decoder layers but also the entire InternViT vision encoder, the MLP projector, token embeddings, and the output head. To avoid overloading GPU 0, it is treated as having only 50% capacity (`vit_alpha=0.5`) when distributing LLM layers. This means GPU 0 receives fewer LLM layers than other GPUs, balancing VRAM usage across the cluster.
Usage
Apply this heuristic when using `--auto` mode for multi-GPU inference. The `split_model()` function in `model/__init__.py` automatically computes the device map. This is primarily used during evaluation and inference, not during training (which uses DeepSpeed for distribution).
The Insight (Rule of Thumb)
- Action: Use `split_model(num_layers, vit_alpha=0.5)` for multi-GPU inference.
- Value: GPU 0 gets ~50% fewer LLM layers than other GPUs.
- Trade-off: Slightly uneven layer distribution, but prevents GPU 0 OOM from hosting ViT + LLM layers.
Reasoning
The InternViT-6B vision encoder alone consumes significant VRAM (multiple GB). On GPU 0, this shares space with the MLP projector, token embeddings, LM head, and some LLM layers. Without the 0.5 alpha adjustment, GPU 0 would be assigned the same number of LLM layers as other GPUs, causing it to run out of memory first. The formula `num_layers / (world_size - 0.5)` effectively distributes layers as if there were 0.5 fewer GPUs, with GPU 0 receiving the reduced allocation.
Code Evidence
From `model/__init__.py:14-36`:
def split_model(num_layers, vit_alpha=0.5):
device_map = {}
world_size = torch.cuda.device_count()
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - vit_alpha))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(
num_layers_per_gpu[0] * (1 - vit_alpha))
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map