Heuristic:Haotian liu LLaVA Image Aspect Ratio Padding Strategy
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Optimization |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Use `--image_aspect_ratio pad` during finetuning to pad non-square images to squares using the CLIP image mean as background color, preserving aspect ratio information for better model performance.
Description
LLaVA supports multiple image aspect ratio handling strategies: `square` (default crop/resize), `pad` (pad to square with mean color), and `anyres` (variable resolution with grid patching). The `pad` strategy is used in V1.5 finetuning to preserve the original image content without cropping. It pads non-square images to a square canvas using the CLIP image processor's mean pixel values as the background color, ensuring the padding blends naturally. The `anyres` strategy (used in later versions) divides images into patches at their best-fit resolution for higher-detail processing. The `group_by_modality_length` flag is used alongside `pad` to batch samples by modality (image vs text-only) for more efficient training.
Usage
Use `--image_aspect_ratio pad` with `--group_by_modality_length True` during Stage 2 finetuning (as done in V1.5 scripts). For pretraining (Stage 1), the default square aspect ratio is sufficient since it uses simpler image-caption pairs. Do not use `pad` for pretraining unless specifically needed.
The Insight (Rule of Thumb)
- Action: Set `--image_aspect_ratio pad` and `--group_by_modality_length True` for finetuning. Use default (square) for pretraining.
- Value: Preserves full image content without cropping, improving model understanding of non-square images.
- Trade-off: Slightly more padding pixels to process, but no information loss from cropping.
- Batch efficiency: `group_by_modality_length` groups image samples and text-only samples into separate batches, preventing inefficient padding when mixing modalities.
Reasoning
Visual question answering and instruction following require the model to see the full image, including edges that would be lost with center-crop resizing. Using the CLIP image mean as padding color ensures the padding is "invisible" to the CLIP vision encoder (it matches the expected background distribution). The modality-grouped sampling ensures that batches with images all have similar sequence lengths (text + image tokens), while text-only batches avoid the overhead of dummy image tensors.
When an image sample does not actually contain an image but the model is multimodal, the code creates a zero tensor of the correct crop size as a placeholder — this is the fallback behavior to maintain batch consistency.
Code Evidence
Padding implementation from `train.py:702-716`:
if self.data_args.image_aspect_ratio == 'pad':
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
Dummy image fallback for text-only samples from `train.py:735-738`:
elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
Modality-grouped sampling from `llava_trainer.py:139-146`:
if self.args.group_by_modality_length:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
self.args.train_batch_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps,
lengths=lengths,
group_by_modality=True,
)
V1.5 finetune script using pad from `scripts/v1_5/finetune.sh:15-16`:
--image_aspect_ratio pad \
--group_by_modality_length True \