Implementation:Haotian liu LLaVA Process Images
Appearance
Overview
Concrete tool for preprocessing images into CLIP-compatible tensors with configurable aspect ratio handling.
Source
- File:
llava/mm_utils.py - Lines: L166-182
Signature
def process_images(
images: List[PIL.Image.Image],
image_processor: CLIPImageProcessor,
model_cfg
) -> Union[torch.Tensor, List[torch.Tensor]]:
"""
Preprocess images for the CLIP vision encoder.
Dispatches to the appropriate preprocessing path based on
model_cfg.image_aspect_ratio.
Args:
images: List of PIL images to preprocess.
image_processor: CLIPImageProcessor instance from the vision tower.
model_cfg: Model configuration with image_aspect_ratio attribute.
Returns:
Preprocessed image tensor(s).
"""
Import
from llava.mm_utils import process_images
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
images |
List[PIL.Image.Image] |
Yes | List of PIL images to preprocess |
image_processor |
CLIPImageProcessor |
Yes | CLIP image preprocessor (obtained from load_pretrained_model())
|
model_cfg |
model config | Yes | Model configuration object with image_aspect_ratio attribute
|
Outputs
| Mode | Output Shape | Description |
|---|---|---|
| square (default) | (N, C, 336, 336) |
Standard CLIP-preprocessed tensor |
| pad | (N, C, 336, 336) |
Padded-to-square then CLIP-preprocessed tensor |
| anyres | (N, num_patches+1, C, 336, 336) |
Multi-scale patches plus global view |
Usage Example
from PIL import Image
from llava.mm_utils import process_images
from llava.model.builder import load_pretrained_model
# Load model
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path="liuhaotian/llava-v1.5-13b",
model_base=None,
model_name="llava-v1.5-13b"
)
# Load and preprocess image
image = Image.open("photo.jpg").convert("RGB")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)
Description
process_images() dispatches to three code paths based on model_cfg.image_aspect_ratio:
Path 1: Default (square)
Calls the CLIP image_processor directly:
image_processor.preprocess(images, return_tensors='pt')['pixel_values']
Path 2: Pad
For each image:
- Calls
expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean)) - The expanded square image is then passed through standard CLIP preprocessing
The padding color is derived from the CLIP processor's image_mean values, scaled to 0-255 RGB range.
Path 3: Anyres
For each image:
- Calls
process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints) - This selects the optimal resolution grid, creates patches, and includes a global downscaled view
- Returns a tensor with an extra patch dimension
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Paper - Improved Baselines with Visual Instruction Tuning - https://arxiv.org/abs/2310.03744 |
| Domains | Computer_Vision, Image_Processing |
| Last Updated | 2026-02-13 14:00 GMT |
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment