Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haotian liu LLaVA Process Images

From Leeroopedia

Overview

Concrete tool for preprocessing images into CLIP-compatible tensors with configurable aspect ratio handling.

Source

  • File: llava/mm_utils.py
  • Lines: L166-182

Signature

def process_images(
    images: List[PIL.Image.Image],
    image_processor: CLIPImageProcessor,
    model_cfg
) -> Union[torch.Tensor, List[torch.Tensor]]:
    """
    Preprocess images for the CLIP vision encoder.

    Dispatches to the appropriate preprocessing path based on
    model_cfg.image_aspect_ratio.

    Args:
        images: List of PIL images to preprocess.
        image_processor: CLIPImageProcessor instance from the vision tower.
        model_cfg: Model configuration with image_aspect_ratio attribute.

    Returns:
        Preprocessed image tensor(s).
    """

Import

from llava.mm_utils import process_images

Inputs

Parameter Type Required Description
images List[PIL.Image.Image] Yes List of PIL images to preprocess
image_processor CLIPImageProcessor Yes CLIP image preprocessor (obtained from load_pretrained_model())
model_cfg model config Yes Model configuration object with image_aspect_ratio attribute

Outputs

Mode Output Shape Description
square (default) (N, C, 336, 336) Standard CLIP-preprocessed tensor
pad (N, C, 336, 336) Padded-to-square then CLIP-preprocessed tensor
anyres (N, num_patches+1, C, 336, 336) Multi-scale patches plus global view

Usage Example

from PIL import Image
from llava.mm_utils import process_images
from llava.model.builder import load_pretrained_model

# Load model
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name="llava-v1.5-13b"
)

# Load and preprocess image
image = Image.open("photo.jpg").convert("RGB")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)

Description

process_images() dispatches to three code paths based on model_cfg.image_aspect_ratio:

Path 1: Default (square)

Calls the CLIP image_processor directly:

image_processor.preprocess(images, return_tensors='pt')['pixel_values']

Path 2: Pad

For each image:

  1. Calls expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean))
  2. The expanded square image is then passed through standard CLIP preprocessing

The padding color is derived from the CLIP processor's image_mean values, scaled to 0-255 RGB range.

Path 3: Anyres

For each image:

  1. Calls process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
  2. This selects the optimal resolution grid, creates patches, and includes a global downscaled view
  3. Returns a tensor with an extra patch dimension

Metadata

Field Value
Knowledge Sources Paper - Improved Baselines with Visual Instruction Tuning - https://arxiv.org/abs/2310.03744
Domains Computer_Vision, Image_Processing
Last Updated 2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment