Implementation:OpenGVLab InternVL Load Model And Tokenizer

Knowledge Sources	InternVL
Domains	Inference, Model_Deployment
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading InternVL models for evaluation and inference with multi-GPU device mapping provided by the InternVL evaluation framework.

Description

The load_model_and_tokenizer function loads an InternVLChatModel and its tokenizer for inference. It supports:

Multi-GPU device mapping via split_model for distributing layers across GPUs
4-bit and 8-bit quantization via BitsAndBytes
Auto device mapping for single-GPU inference
Standard single-GPU loading on the current rank's device

The companion split_model function computes the per-GPU layer allocation.

Usage

Used by all evaluation scripts (evaluate_vqa.py, evaluate_mantis.py, etc.) to load the model before running inference.

Code Reference

Source Location

Repository: InternVL
File: internvl_chat/internvl/model/__init__.py
Lines: L14-51

Signature

def split_model(num_layers, vit_alpha=0.5):
    """
    Compute device_map distributing LLM layers across GPUs.

    Args:
        num_layers: int - Total number of LLM transformer layers
        vit_alpha: float - Fraction of GPU 0 reserved for ViT (default 0.5)

    Returns:
        dict - Device map mapping module names to GPU indices
    """

def load_model_and_tokenizer(args):
    """
    Load InternVLChatModel and tokenizer for inference.

    Args:
        args: Namespace with attributes:
            checkpoint: str - Model path
            auto: bool - Use auto device mapping (single GPU)
            load_in_8bit: bool - 8-bit quantization
            load_in_4bit: bool - 4-bit quantization

    Returns:
        Tuple[InternVLChatModel, AutoTokenizer]
    """

Import

from internvl.model import load_model_and_tokenizer, split_model

I/O Contract

Inputs

Name	Type	Required	Description
args.checkpoint	str	Yes	Path to model checkpoint
args.auto	bool	No	Enable auto device mapping for single GPU
args.load_in_8bit	bool	No	8-bit quantization
args.load_in_4bit	bool	No	4-bit quantization

Outputs

Name	Type	Description
model	InternVLChatModel	Model in eval mode distributed across GPUs
tokenizer	AutoTokenizer	Tokenizer loaded from checkpoint

Usage Examples

Multi-GPU Inference Loading

import argparse
from internvl.model import load_model_and_tokenizer

args = argparse.Namespace(
    checkpoint='OpenGVLab/InternVL2_5-8B',
    auto=False,
    load_in_8bit=False,
    load_in_4bit=False,
)

model, tokenizer = load_model_and_tokenizer(args)
# Model distributed across available GPUs in eval mode

Single-GPU with Auto Mapping

args = argparse.Namespace(
    checkpoint='OpenGVLab/InternVL2_5-8B',
    auto=True,
    load_in_8bit=False,
    load_in_4bit=False,
)

model, tokenizer = load_model_and_tokenizer(args)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment