Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenGVLab InternVL Model Inference Loading

From Leeroopedia


Knowledge Sources
Domains Inference, Model_Deployment, Distributed_Computing
Last Updated 2026-02-07 00:00 GMT

Overview

A model loading strategy for inference that distributes model layers across multiple GPUs using an intelligent device mapping that accounts for the vision encoder's memory footprint.

Description

Large vision-language models may not fit on a single GPU for inference. The inference loading strategy addresses this by:

  • Device mapping: Distributing LLM layers across available GPUs while reserving space on GPU 0 for the vision encoder, MLP projector, and embedding layers
  • Quantization support: Optional 4-bit or 8-bit quantization for reduced memory usage
  • Auto mode: A single-GPU mode with automatic device mapping for simpler deployments

The key insight is that GPU 0 hosts both the vision encoder and the beginning of the LLM, so it needs a reduced allocation of LLM layers compared to other GPUs. The split_model function computes this allocation based on the total number of LLM layers and a configurable ViT allocation factor.

Usage

Use this principle when loading InternVL models for evaluation or inference, particularly when the model is too large for a single GPU.

Theoretical Basis

# Pseudo-code: Device mapping for multi-GPU inference
def compute_device_map(num_llm_layers, num_gpus, vit_alpha=0.5):
    # GPU 0 hosts: ViT + MLP + embeddings + some LLM layers
    # Other GPUs host: remaining LLM layers evenly distributed

    gpu0_allocation = num_llm_layers / num_gpus * (1 - vit_alpha)
    other_allocation = (num_llm_layers - gpu0_allocation) / (num_gpus - 1)

    device_map = {}
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.embed_tokens'] = 0

    current_gpu = 0
    for layer_idx in range(num_llm_layers):
        device_map[f'language_model.model.layers.{layer_idx}'] = current_gpu
        # Advance GPU when allocation is full

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment