Implementation:OpenGVLab InternVL Multimodal Utilities
| Knowledge Sources | |
|---|---|
| Domains | Image_Processing, Tokenization, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module provides multimodal utility functions for image preprocessing, prompt tokenization with image token insertion, model name extraction, and keyword-based generation stopping in the LLaVA pipeline.
Description
The mm_utils.py module contains five key functions and one class that form the utility layer between raw inputs and the LLaVA model:
load_image_from_base64(image): Decodes a base64-encoded string into a PIL Image, used by MMBench evaluation.
expand2square(pil_img, background_color): Pads a non-square image to a square by adding background fill (using the image processor's mean color). If the image is wider than tall, it adds vertical padding; if taller than wide, it adds horizontal padding. Returns the original image if already square.
process_images(images, image_processor, model_cfg): Preprocesses a list of PIL images according to the model configuration. When image_aspect_ratio == 'pad', images are first padded to square via expand2square before processing. Otherwise, the standard image processor is used directly. Returns a stacked tensor if all images have the same shape.
tokenizer_image_token(prompt, tokenizer, image_token_index, return_tensors): Splits the prompt on <image> tokens, tokenizes each text chunk, and inserts IMAGE_TOKEN_INDEX placeholders between chunks. Handles the BOS token correctly by only including it once at the start. Returns either a list of token IDs or a PyTorch tensor.
get_model_name_from_path(model_path): Extracts a human-readable model name from a filesystem path. Handles checkpoint subdirectories by combining the parent directory name with the checkpoint folder name.
KeywordsStoppingCriteria: A custom StoppingCriteria subclass that stops text generation when any specified keyword is detected in the output. It checks both token-level matching (comparing the last N token IDs) and string-level matching (decoding recent tokens and checking for keyword presence).
Usage
These utilities are imported throughout the LLaVA evaluation scripts. tokenizer_image_token and process_images are essential for any inference pipeline, KeywordsStoppingCriteria controls generation length, and get_model_name_from_path provides model identification.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/mm_utils.py
- Lines: 1-108
Signature
def load_image_from_base64(image: str) -> Image.Image: ...
def expand2square(pil_img: Image.Image, background_color: tuple) -> Image.Image: ...
def process_images(images: list, image_processor, model_cfg) -> torch.Tensor: ...
def tokenizer_image_token(prompt: str, tokenizer, image_token_index: int = IMAGE_TOKEN_INDEX,
return_tensors: str = None) -> Union[list, torch.Tensor]: ...
def get_model_name_from_path(model_path: str) -> str: ...
class KeywordsStoppingCriteria(StoppingCriteria):
def __init__(self, keywords: list, tokenizer, input_ids: torch.Tensor): ...
def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: ...
Import
from llava.mm_utils import (
load_image_from_base64,
expand2square,
process_images,
tokenizer_image_token,
get_model_name_from_path,
KeywordsStoppingCriteria,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image (load_image_from_base64) | str | Yes | Base64-encoded image string |
| pil_img (expand2square) | PIL.Image | Yes | Input image to pad to square |
| background_color (expand2square) | tuple | Yes | RGB fill color for padding |
| images (process_images) | list of PIL.Image | Yes | List of images to preprocess |
| image_processor | object | Yes | HuggingFace image processor |
| model_cfg | object | Yes | Model config with image_aspect_ratio attribute |
| prompt (tokenizer_image_token) | str | Yes | Text prompt containing <image> tokens |
| tokenizer | object | Yes | HuggingFace tokenizer |
| model_path (get_model_name_from_path) | str | Yes | Filesystem path to model |
Outputs
| Name | Type | Description |
|---|---|---|
| load_image_from_base64 return | PIL.Image | Decoded image |
| expand2square return | PIL.Image | Square-padded image |
| process_images return | torch.Tensor | Preprocessed image tensor(s) |
| tokenizer_image_token return | list or torch.Tensor | Token IDs with image placeholders |
| get_model_name_from_path return | str | Extracted model name |
| KeywordsStoppingCriteria.__call__ return | bool | True if any keyword detected in output |
Usage Examples
Basic Usage
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
from llava.constants import IMAGE_TOKEN_INDEX
# Tokenize a prompt with image placeholder
prompt = "<image>\nDescribe this image."
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
# Process images with aspect ratio padding
image_tensors = process_images([pil_image], image_processor, model.config)
# Extract model name from path
name = get_model_name_from_path("/models/llava-v1.5-7b/checkpoint-1000")
# Returns: "llava-v1.5-7b_checkpoint-1000"