Principle:Haotian liu LLaVA Image Preprocessing

Overview

Technique for transforming raw images into normalized tensor representations compatible with a CLIP vision encoder.

Description

Image preprocessing converts PIL images into CLIP-compatible tensors. LLaVA supports three aspect ratio strategies, each balancing information preservation against computational cost:

Strategy 1: Square (default)

Direct CLIP preprocessing to 336x336 pixels. The image is resized and center-cropped to a square. This is the simplest and fastest strategy but may lose information from non-square images.

Strategy 2: Pad

Expand the image to a square by padding with the CLIP image mean background color, then apply standard CLIP preprocessing. This preserves the full image content and aspect ratio information without distortion.

Process:

Determine the longer dimension of the input image
Create a new square image of that dimension, filled with the CLIP mean pixel values (approximately [0.48, 0.46, 0.41] in RGB)
Paste the original image centered on the square canvas
Apply standard CLIP preprocessing (resize to 336x336, normalize)

Strategy 3: Anyres (multi-resolution)

Multi-scale processing that provides high-resolution visual understanding:

Select the best resolution from configured grid pinpoints (e.g., 672x672, 336x672, 672x336)
Resize and pad the image to the selected resolution
Divide the image into patches (each 336x336)
Include a downscaled global view (the full image resized to 336x336)
Return all patches plus the global view as a tensor

The result is a tensor (or list of tensors for anyres) ready for the vision encoder.

Usage

Use whenever feeding images to a LLaVA model for inference or evaluation. The appropriate mode is determined by the model's config.image_aspect_ratio attribute:

'square' or unset -- Direct CLIP preprocessing
'pad' -- Padded square preprocessing
'anyres' -- Multi-scale patch processing

Theoretical Basis

CLIP ViT-L/14 expects 336x336 RGB inputs normalized with ImageNet statistics.

Padding preserves aspect ratio information that would be lost with naive resizing or cropping. The padding color matches the CLIP training distribution's mean, minimizing the impact of added pixels on the vision encoder's activations.

Anyres subdivides high-resolution images into grid patches for multi-scale visual understanding. Each patch is processed independently by CLIP, and the resulting features are concatenated. This allows the model to perceive fine-grained details (via patches) while maintaining global context (via the downscaled global view). The approach is inspired by methods that process images at multiple scales to capture both local and global features.

Metadata

Field	Value
Knowledge Sources	Paper - Improved Baselines with Visual Instruction Tuning - https://arxiv.org/abs/2310.03744
Domains	Computer_Vision, Image_Processing
Last Updated	2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment