Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL Multimodal Utilities

From Leeroopedia
Revision as of 16:15, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenGVLab_InternVL_Multimodal_Utilities.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Image_Processing, Tokenization, Multimodal
Last Updated 2026-02-07 14:00 GMT

Overview

This module provides multimodal utility functions for image preprocessing, prompt tokenization with image token insertion, model name extraction, and keyword-based generation stopping in the LLaVA pipeline.

Description

The mm_utils.py module contains five key functions and one class that form the utility layer between raw inputs and the LLaVA model:

load_image_from_base64(image): Decodes a base64-encoded string into a PIL Image, used by MMBench evaluation.

expand2square(pil_img, background_color): Pads a non-square image to a square by adding background fill (using the image processor's mean color). If the image is wider than tall, it adds vertical padding; if taller than wide, it adds horizontal padding. Returns the original image if already square.

process_images(images, image_processor, model_cfg): Preprocesses a list of PIL images according to the model configuration. When image_aspect_ratio == 'pad', images are first padded to square via expand2square before processing. Otherwise, the standard image processor is used directly. Returns a stacked tensor if all images have the same shape.

tokenizer_image_token(prompt, tokenizer, image_token_index, return_tensors): Splits the prompt on <image> tokens, tokenizes each text chunk, and inserts IMAGE_TOKEN_INDEX placeholders between chunks. Handles the BOS token correctly by only including it once at the start. Returns either a list of token IDs or a PyTorch tensor.

get_model_name_from_path(model_path): Extracts a human-readable model name from a filesystem path. Handles checkpoint subdirectories by combining the parent directory name with the checkpoint folder name.

KeywordsStoppingCriteria: A custom StoppingCriteria subclass that stops text generation when any specified keyword is detected in the output. It checks both token-level matching (comparing the last N token IDs) and string-level matching (decoding recent tokens and checking for keyword presence).

Usage

These utilities are imported throughout the LLaVA evaluation scripts. tokenizer_image_token and process_images are essential for any inference pipeline, KeywordsStoppingCriteria controls generation length, and get_model_name_from_path provides model identification.

Code Reference

Source Location

Signature

def load_image_from_base64(image: str) -> Image.Image: ...

def expand2square(pil_img: Image.Image, background_color: tuple) -> Image.Image: ...

def process_images(images: list, image_processor, model_cfg) -> torch.Tensor: ...

def tokenizer_image_token(prompt: str, tokenizer, image_token_index: int = IMAGE_TOKEN_INDEX,
                          return_tensors: str = None) -> Union[list, torch.Tensor]: ...

def get_model_name_from_path(model_path: str) -> str: ...

class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords: list, tokenizer, input_ids: torch.Tensor): ...
    def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: ...

Import

from llava.mm_utils import (
    load_image_from_base64,
    expand2square,
    process_images,
    tokenizer_image_token,
    get_model_name_from_path,
    KeywordsStoppingCriteria,
)

I/O Contract

Inputs

Name Type Required Description
image (load_image_from_base64) str Yes Base64-encoded image string
pil_img (expand2square) PIL.Image Yes Input image to pad to square
background_color (expand2square) tuple Yes RGB fill color for padding
images (process_images) list of PIL.Image Yes List of images to preprocess
image_processor object Yes HuggingFace image processor
model_cfg object Yes Model config with image_aspect_ratio attribute
prompt (tokenizer_image_token) str Yes Text prompt containing <image> tokens
tokenizer object Yes HuggingFace tokenizer
model_path (get_model_name_from_path) str Yes Filesystem path to model

Outputs

Name Type Description
load_image_from_base64 return PIL.Image Decoded image
expand2square return PIL.Image Square-padded image
process_images return torch.Tensor Preprocessed image tensor(s)
tokenizer_image_token return list or torch.Tensor Token IDs with image placeholders
get_model_name_from_path return str Extracted model name
KeywordsStoppingCriteria.__call__ return bool True if any keyword detected in output

Usage Examples

Basic Usage

from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
from llava.constants import IMAGE_TOKEN_INDEX

# Tokenize a prompt with image placeholder
prompt = "<image>\nDescribe this image."
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')

# Process images with aspect ratio padding
image_tensors = process_images([pil_image], image_processor, model.config)

# Extract model name from path
name = get_model_name_from_path("/models/llava-v1.5-7b/checkpoint-1000")
# Returns: "llava-v1.5-7b_checkpoint-1000"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment