Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haotian liu LLaVA Image Preprocessing

From Leeroopedia

Overview

Technique for transforming raw images into normalized tensor representations compatible with a CLIP vision encoder.

Description

Image preprocessing converts PIL images into CLIP-compatible tensors. LLaVA supports three aspect ratio strategies, each balancing information preservation against computational cost:

Strategy 1: Square (default)

Direct CLIP preprocessing to 336x336 pixels. The image is resized and center-cropped to a square. This is the simplest and fastest strategy but may lose information from non-square images.

Strategy 2: Pad

Expand the image to a square by padding with the CLIP image mean background color, then apply standard CLIP preprocessing. This preserves the full image content and aspect ratio information without distortion.

Process:

  1. Determine the longer dimension of the input image
  2. Create a new square image of that dimension, filled with the CLIP mean pixel values (approximately [0.48, 0.46, 0.41] in RGB)
  3. Paste the original image centered on the square canvas
  4. Apply standard CLIP preprocessing (resize to 336x336, normalize)

Strategy 3: Anyres (multi-resolution)

Multi-scale processing that provides high-resolution visual understanding:

  1. Select the best resolution from configured grid pinpoints (e.g., 672x672, 336x672, 672x336)
  2. Resize and pad the image to the selected resolution
  3. Divide the image into patches (each 336x336)
  4. Include a downscaled global view (the full image resized to 336x336)
  5. Return all patches plus the global view as a tensor

The result is a tensor (or list of tensors for anyres) ready for the vision encoder.

Usage

Use whenever feeding images to a LLaVA model for inference or evaluation. The appropriate mode is determined by the model's config.image_aspect_ratio attribute:

  • 'square' or unset -- Direct CLIP preprocessing
  • 'pad' -- Padded square preprocessing
  • 'anyres' -- Multi-scale patch processing

Theoretical Basis

CLIP ViT-L/14 expects 336x336 RGB inputs normalized with ImageNet statistics.

Padding preserves aspect ratio information that would be lost with naive resizing or cropping. The padding color matches the CLIP training distribution's mean, minimizing the impact of added pixels on the vision encoder's activations.

Anyres subdivides high-resolution images into grid patches for multi-scale visual understanding. Each patch is processed independently by CLIP, and the resulting features are concatenated. This allows the model to perceive fine-grained details (via patches) while maintaining global context (via the downscaled global view). The approach is inspired by methods that process images at multiple scales to capture both local and global features.

Metadata

Field Value
Knowledge Sources Paper - Improved Baselines with Visual Instruction Tuning - https://arxiv.org/abs/2310.03744
Domains Computer_Vision, Image_Processing
Last Updated 2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment