Principle:OpenGVLab InternVL Model Inference Loading
| Knowledge Sources | |
|---|---|
| Domains | Inference, Model_Deployment, Distributed_Computing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A model loading strategy for inference that distributes model layers across multiple GPUs using an intelligent device mapping that accounts for the vision encoder's memory footprint.
Description
Large vision-language models may not fit on a single GPU for inference. The inference loading strategy addresses this by:
- Device mapping: Distributing LLM layers across available GPUs while reserving space on GPU 0 for the vision encoder, MLP projector, and embedding layers
- Quantization support: Optional 4-bit or 8-bit quantization for reduced memory usage
- Auto mode: A single-GPU mode with automatic device mapping for simpler deployments
The key insight is that GPU 0 hosts both the vision encoder and the beginning of the LLM, so it needs a reduced allocation of LLM layers compared to other GPUs. The split_model function computes this allocation based on the total number of LLM layers and a configurable ViT allocation factor.
Usage
Use this principle when loading InternVL models for evaluation or inference, particularly when the model is too large for a single GPU.
Theoretical Basis
# Pseudo-code: Device mapping for multi-GPU inference
def compute_device_map(num_llm_layers, num_gpus, vit_alpha=0.5):
# GPU 0 hosts: ViT + MLP + embeddings + some LLM layers
# Other GPUs host: remaining LLM layers evenly distributed
gpu0_allocation = num_llm_layers / num_gpus * (1 - vit_alpha)
other_allocation = (num_llm_layers - gpu0_allocation) / (num_gpus - 1)
device_map = {}
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.embed_tokens'] = 0
current_gpu = 0
for layer_idx in range(num_llm_layers):
device_map[f'language_model.model.layers.{layer_idx}'] = current_gpu
# Advance GPU when allocation is full