Principle:Haotian liu LLaVA Image Preprocessing
Overview
Technique for transforming raw images into normalized tensor representations compatible with a CLIP vision encoder.
Description
Image preprocessing converts PIL images into CLIP-compatible tensors. LLaVA supports three aspect ratio strategies, each balancing information preservation against computational cost:
Strategy 1: Square (default)
Direct CLIP preprocessing to 336x336 pixels. The image is resized and center-cropped to a square. This is the simplest and fastest strategy but may lose information from non-square images.
Strategy 2: Pad
Expand the image to a square by padding with the CLIP image mean background color, then apply standard CLIP preprocessing. This preserves the full image content and aspect ratio information without distortion.
Process:
- Determine the longer dimension of the input image
- Create a new square image of that dimension, filled with the CLIP mean pixel values (approximately [0.48, 0.46, 0.41] in RGB)
- Paste the original image centered on the square canvas
- Apply standard CLIP preprocessing (resize to 336x336, normalize)
Strategy 3: Anyres (multi-resolution)
Multi-scale processing that provides high-resolution visual understanding:
- Select the best resolution from configured grid pinpoints (e.g., 672x672, 336x672, 672x336)
- Resize and pad the image to the selected resolution
- Divide the image into patches (each 336x336)
- Include a downscaled global view (the full image resized to 336x336)
- Return all patches plus the global view as a tensor
The result is a tensor (or list of tensors for anyres) ready for the vision encoder.
Usage
Use whenever feeding images to a LLaVA model for inference or evaluation. The appropriate mode is determined by the model's config.image_aspect_ratio attribute:
'square'or unset -- Direct CLIP preprocessing'pad'-- Padded square preprocessing'anyres'-- Multi-scale patch processing
Theoretical Basis
CLIP ViT-L/14 expects 336x336 RGB inputs normalized with ImageNet statistics.
Padding preserves aspect ratio information that would be lost with naive resizing or cropping. The padding color matches the CLIP training distribution's mean, minimizing the impact of added pixels on the vision encoder's activations.
Anyres subdivides high-resolution images into grid patches for multi-scale visual understanding. Each patch is processed independently by CLIP, and the resulting features are concatenated. This allows the model to perceive fine-grained details (via patches) while maintaining global context (via the downscaled global view). The approach is inspired by methods that process images at multiple scales to capture both local and global features.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Paper - Improved Baselines with Visual Instruction Tuning - https://arxiv.org/abs/2310.03744 |
| Domains | Computer_Vision, Image_Processing |
| Last Updated | 2026-02-13 14:00 GMT |