Heuristic:OpenGVLab InternVL Multi GPU ViT Device Mapping

Knowledge Sources	OpenGVLab/InternVL InternVL multi-GPU deployment
Domains	Infrastructure, Optimization, Deployment
Last Updated	2026-02-07 14:00 GMT

Overview

Multi-GPU device mapping strategy that treats the first GPU as half capacity (vit_alpha=0.5) because it hosts the ViT encoder, MLP projector, and embedding layers alongside LLM layers.

Description

When deploying InternVL models across multiple GPUs for inference, the device mapping algorithm accounts for the fact that GPU 0 hosts not only LLM decoder layers but also the entire InternViT vision encoder, the MLP projector, token embeddings, and the output head. To avoid overloading GPU 0, it is treated as having only 50% capacity (`vit_alpha=0.5`) when distributing LLM layers. This means GPU 0 receives fewer LLM layers than other GPUs, balancing VRAM usage across the cluster.

Usage

Apply this heuristic when using `--auto` mode for multi-GPU inference. The `split_model()` function in `model/__init__.py` automatically computes the device map. This is primarily used during evaluation and inference, not during training (which uses DeepSpeed for distribution).

The Insight (Rule of Thumb)

Action: Use `split_model(num_layers, vit_alpha=0.5)` for multi-GPU inference.
Value: GPU 0 gets ~50% fewer LLM layers than other GPUs.
Trade-off: Slightly uneven layer distribution, but prevents GPU 0 OOM from hosting ViT + LLM layers.

Reasoning

The InternViT-6B vision encoder alone consumes significant VRAM (multiple GB). On GPU 0, this shares space with the MLP projector, token embeddings, LM head, and some LLM layers. Without the 0.5 alpha adjustment, GPU 0 would be assigned the same number of LLM layers as other GPUs, causing it to run out of memory first. The formula `num_layers / (world_size - 0.5)` effectively distributes layers as if there were 0.5 fewer GPUs, with GPU 0 receiving the reduced allocation.

Code Evidence

From `model/__init__.py:14-36`:

def split_model(num_layers, vit_alpha=0.5):
    device_map = {}
    world_size = torch.cuda.device_count()
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - vit_alpha))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(
        num_layers_per_gpu[0] * (1 - vit_alpha))
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
    return device_map

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment