Implementation:OpenGVLab InternVL Multimodal Utilities

Knowledge Sources	OpenGVLab_InternVL
Domains	Image_Processing, Tokenization, Multimodal
Last Updated	2026-02-07 14:00 GMT

Overview

This module provides multimodal utility functions for image preprocessing, prompt tokenization with image token insertion, model name extraction, and keyword-based generation stopping in the LLaVA pipeline.

Description

The mm_utils.py module contains five key functions and one class that form the utility layer between raw inputs and the LLaVA model:

load_image_from_base64(image): Decodes a base64-encoded string into a PIL Image, used by MMBench evaluation.

expand2square(pil_img, background_color): Pads a non-square image to a square by adding background fill (using the image processor's mean color). If the image is wider than tall, it adds vertical padding; if taller than wide, it adds horizontal padding. Returns the original image if already square.

process_images(images, image_processor, model_cfg): Preprocesses a list of PIL images according to the model configuration. When image_aspect_ratio == 'pad', images are first padded to square via expand2square before processing. Otherwise, the standard image processor is used directly. Returns a stacked tensor if all images have the same shape.

tokenizer_image_token(prompt, tokenizer, image_token_index, return_tensors): Splits the prompt on <image> tokens, tokenizes each text chunk, and inserts IMAGE_TOKEN_INDEX placeholders between chunks. Handles the BOS token correctly by only including it once at the start. Returns either a list of token IDs or a PyTorch tensor.

get_model_name_from_path(model_path): Extracts a human-readable model name from a filesystem path. Handles checkpoint subdirectories by combining the parent directory name with the checkpoint folder name.

KeywordsStoppingCriteria: A custom StoppingCriteria subclass that stops text generation when any specified keyword is detected in the output. It checks both token-level matching (comparing the last N token IDs) and string-level matching (decoding recent tokens and checking for keyword presence).

Usage

These utilities are imported throughout the LLaVA evaluation scripts. tokenizer_image_token and process_images are essential for any inference pipeline, KeywordsStoppingCriteria controls generation length, and get_model_name_from_path provides model identification.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/mm_utils.py
Lines: 1-108

Signature

def load_image_from_base64(image: str) -> Image.Image: ...

def expand2square(pil_img: Image.Image, background_color: tuple) -> Image.Image: ...

def process_images(images: list, image_processor, model_cfg) -> torch.Tensor: ...

def tokenizer_image_token(prompt: str, tokenizer, image_token_index: int = IMAGE_TOKEN_INDEX,
                          return_tensors: str = None) -> Union[list, torch.Tensor]: ...

def get_model_name_from_path(model_path: str) -> str: ...

class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords: list, tokenizer, input_ids: torch.Tensor): ...
    def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: ...

Import

from llava.mm_utils import (
    load_image_from_base64,
    expand2square,
    process_images,
    tokenizer_image_token,
    get_model_name_from_path,
    KeywordsStoppingCriteria,
)

I/O Contract

Inputs

Name	Type	Required	Description
image (load_image_from_base64)	str	Yes	Base64-encoded image string
pil_img (expand2square)	PIL.Image	Yes	Input image to pad to square
background_color (expand2square)	tuple	Yes	RGB fill color for padding
images (process_images)	list of PIL.Image	Yes	List of images to preprocess
image_processor	object	Yes	HuggingFace image processor
model_cfg	object	Yes	Model config with image_aspect_ratio attribute
prompt (tokenizer_image_token)	str	Yes	Text prompt containing <image> tokens
tokenizer	object	Yes	HuggingFace tokenizer
model_path (get_model_name_from_path)	str	Yes	Filesystem path to model

Outputs

Name	Type	Description
load_image_from_base64 return	PIL.Image	Decoded image
expand2square return	PIL.Image	Square-padded image
process_images return	torch.Tensor	Preprocessed image tensor(s)
tokenizer_image_token return	list or torch.Tensor	Token IDs with image placeholders
get_model_name_from_path return	str	Extracted model name
KeywordsStoppingCriteria.__call__ return	bool	True if any keyword detected in output

Usage Examples

Basic Usage

from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
from llava.constants import IMAGE_TOKEN_INDEX

# Tokenize a prompt with image placeholder
prompt = "<image>\nDescribe this image."
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')

# Process images with aspect ratio padding
image_tensors = process_images([pil_image], image_processor, model.config)

# Extract model name from path
name = get_model_name_from_path("/models/llava-v1.5-7b/checkpoint-1000")
# Returns: "llava-v1.5-7b_checkpoint-1000"

Related Pages

Principle:OpenGVLab_InternVL_Image_Transform_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment