Heuristic:Haotian liu LLaVA Image Aspect Ratio Padding Strategy

Knowledge Sources	Haotian-liu/LLaVA
Domains	Computer_Vision, Optimization
Last Updated	2026-02-13 23:00 GMT

Overview

Use `--image_aspect_ratio pad` during finetuning to pad non-square images to squares using the CLIP image mean as background color, preserving aspect ratio information for better model performance.

Description

LLaVA supports multiple image aspect ratio handling strategies: `square` (default crop/resize), `pad` (pad to square with mean color), and `anyres` (variable resolution with grid patching). The `pad` strategy is used in V1.5 finetuning to preserve the original image content without cropping. It pads non-square images to a square canvas using the CLIP image processor's mean pixel values as the background color, ensuring the padding blends naturally. The `anyres` strategy (used in later versions) divides images into patches at their best-fit resolution for higher-detail processing. The `group_by_modality_length` flag is used alongside `pad` to batch samples by modality (image vs text-only) for more efficient training.

Usage

Use `--image_aspect_ratio pad` with `--group_by_modality_length True` during Stage 2 finetuning (as done in V1.5 scripts). For pretraining (Stage 1), the default square aspect ratio is sufficient since it uses simpler image-caption pairs. Do not use `pad` for pretraining unless specifically needed.

The Insight (Rule of Thumb)

Action: Set `--image_aspect_ratio pad` and `--group_by_modality_length True` for finetuning. Use default (square) for pretraining.
Value: Preserves full image content without cropping, improving model understanding of non-square images.
Trade-off: Slightly more padding pixels to process, but no information loss from cropping.
Batch efficiency: `group_by_modality_length` groups image samples and text-only samples into separate batches, preventing inefficient padding when mixing modalities.

Reasoning

Visual question answering and instruction following require the model to see the full image, including edges that would be lost with center-crop resizing. Using the CLIP image mean as padding color ensures the padding is "invisible" to the CLIP vision encoder (it matches the expected background distribution). The modality-grouped sampling ensures that batches with images all have similar sequence lengths (text + image tokens), while text-only batches avoid the overhead of dummy image tensors.

When an image sample does not actually contain an image but the model is multimodal, the code creates a zero tensor of the correct crop size as a placeholder — this is the fallback behavior to maintain batch consistency.

Code Evidence

Padding implementation from `train.py:702-716`:

if self.data_args.image_aspect_ratio == 'pad':
    def expand2square(pil_img, background_color):
        width, height = pil_img.size
        if width == height:
            return pil_img
        elif width > height:
            result = Image.new(pil_img.mode, (width, width), background_color)
            result.paste(pil_img, (0, (width - height) // 2))
            return result
        else:
            result = Image.new(pil_img.mode, (height, height), background_color)
            result.paste(pil_img, ((height - width) // 2, 0))
            return result
    image = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
    image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

Dummy image fallback for text-only samples from `train.py:735-738`:

elif self.data_args.is_multimodal:
    # image does not exist in the data, but the model is multimodal
    crop_size = self.data_args.image_processor.crop_size
    data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])

Modality-grouped sampling from `llava_trainer.py:139-146`:

if self.args.group_by_modality_length:
    lengths = self.train_dataset.modality_lengths
    return LengthGroupedSampler(
        self.args.train_batch_size,
        world_size=self.args.world_size * self.args.gradient_accumulation_steps,
        lengths=lengths,
        group_by_modality=True,
    )

V1.5 finetune script using pad from `scripts/v1_5/finetune.sh:15-16`:

    --image_aspect_ratio pad \
    --group_by_modality_length True \

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment