Implementation:Haotian liu LLaVA Process Images

Overview

Concrete tool for preprocessing images into CLIP-compatible tensors with configurable aspect ratio handling.

Source

File: llava/mm_utils.py
Lines: L166-182

Signature

def process_images(
    images: List[PIL.Image.Image],
    image_processor: CLIPImageProcessor,
    model_cfg
) -> Union[torch.Tensor, List[torch.Tensor]]:
    """
    Preprocess images for the CLIP vision encoder.

    Dispatches to the appropriate preprocessing path based on
    model_cfg.image_aspect_ratio.

    Args:
        images: List of PIL images to preprocess.
        image_processor: CLIPImageProcessor instance from the vision tower.
        model_cfg: Model configuration with image_aspect_ratio attribute.

    Returns:
        Preprocessed image tensor(s).
    """

Import

from llava.mm_utils import process_images

Inputs

Parameter	Type	Required	Description
`images`	`List[PIL.Image.Image]`	Yes	List of PIL images to preprocess
`image_processor`	`CLIPImageProcessor`	Yes	CLIP image preprocessor (obtained from `load_pretrained_model()`)
`model_cfg`	model config	Yes	Model configuration object with `image_aspect_ratio` attribute

Outputs

Mode	Output Shape	Description
square (default)	`(N, C, 336, 336)`	Standard CLIP-preprocessed tensor
pad	`(N, C, 336, 336)`	Padded-to-square then CLIP-preprocessed tensor
anyres	`(N, num_patches+1, C, 336, 336)`	Multi-scale patches plus global view

Usage Example

from PIL import Image
from llava.mm_utils import process_images
from llava.model.builder import load_pretrained_model

# Load model
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name="llava-v1.5-13b"
)

# Load and preprocess image
image = Image.open("photo.jpg").convert("RGB")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)

Description

process_images() dispatches to three code paths based on model_cfg.image_aspect_ratio:

Path 1: Default (square)

Calls the CLIP image_processor directly:

image_processor.preprocess(images, return_tensors='pt')['pixel_values']

Path 2: Pad

For each image:

Calls expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean))
The expanded square image is then passed through standard CLIP preprocessing

The padding color is derived from the CLIP processor's image_mean values, scaled to 0-255 RGB range.

Path 3: Anyres

For each image:

Calls process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
This selects the optimal resolution grid, creates patches, and includes a global downscaled view
Returns a tensor with an extra patch dimension

Metadata

Field	Value
Knowledge Sources	Paper - Improved Baselines with Visual Instruction Tuning - https://arxiv.org/abs/2310.03744
Domains	Computer_Vision, Image_Processing
Last Updated	2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment