Heuristic:Mlfoundations Open flamingo Hungarian MMC4 Image Text Matching

Knowledge Sources	OpenFlamingo MMC4 Dataset
Domains	Data_Loading, Computer_Vision, Vision_Language
Last Updated	2026-02-08 03:30 GMT

Overview

Optimal one-to-one image-sentence matching in MMC4 interleaved documents using the Hungarian algorithm, with similarity threshold filtering to reject low-quality matches.

Description

MMC4 documents contain multiple images and sentences with a pre-computed similarity matrix (images x sentences). To create training sequences, OpenFlamingo uses the Hungarian algorithm (`scipy.optimize.linear_sum_assignment`) to find the optimal one-to-one assignment between images and sentences that maximizes total similarity. After matching, pairs with similarity below a configurable threshold (`--mmc4_textsim_threshold`, default 30, reference script uses 0.24) are discarded. This ensures each image is paired with its most relevant sentence, and low-quality matches are filtered out.

Usage

Apply this heuristic when preprocessing MMC4 interleaved image-text data for training. It is automatically applied in the `preprocess_interleaved` function during data loading. The threshold should be tuned based on the similarity matrix scale (varies by CLIP model used for pre-computation).

The Insight (Rule of Thumb)

Action: Use `linear_sum_assignment(-sim_matrix)` for optimal image-sentence pairing. Filter pairs with `sim_score < sim_threshold`.
Value: `--mmc4_textsim_threshold 0.24` (reference training script). Additional filters: `--mmc4_min_num_images 1`, `--mmc4_max_num_images 6`. 50% random rejection of single-image samples.
Trade-off: Higher thresholds mean better quality matches but fewer training samples. Lower thresholds include more data but with noisier correspondences.
Additional filters: Small images (< 10KB) are discarded. Samples with no valid images after filtering are skipped. Single-image samples at the end of a sequence are rejected (all labels would be -100).

Reasoning

MMC4 documents are web pages with multiple images and text paragraphs. Not every image is relevant to every sentence. A naive approach of placing images at arbitrary positions would create noisy training signals. The Hungarian algorithm provides the globally optimal one-to-one assignment that maximizes total image-text similarity. The threshold filter then removes matches that are too weak, ensuring the model only trains on genuine image-text correspondences.

The 50% random rejection of single-image samples (data.py:248-251) prevents the training data from being dominated by simple single-image cases, encouraging the model to learn from multi-image interleaved examples.

Code Evidence

Hungarian matching from `open_flamingo/train/data.py:179-195`:

sim_matrix = np.array(sim_matrix)  # of shape images x sentences
sim_matrix = sim_matrix[valid_image_indices]

# negate the similarities to turn them into costs
cost_matrix = -sim_matrix
# find one to one assignments
image_indices, sentence_indices = linear_sum_assignment(cost_matrix)

images, sentence_ixs = [], []
for i, sim_ix in zip(image_indices, sentence_indices):
    sim_score = sim_matrix[i][sim_ix]
    if sim_score < sim_threshold:
        continue
    images.append(valid_images[i])
    sentence_ixs.append(sim_ix)

Minimum image size filter from `open_flamingo/train/data.py:168-170`:

# filter to images >= 10KB
if len(rawbytes) // 1000 <= MIN_KB:
    continue

Single-image random rejection from `open_flamingo/train/data.py:248-251`:

elif (
    num_images == 1 and random.random() <= 0.5
):  # 50% chance of keeping single image samples
    raise ValueError("Only one image in sample")

Edge case: single image at end from `open_flamingo/train/data.py:254-263`:

# avoid the situation where there's one <image> token and it's at the end
if (
    num_images == 1
    and text_tensor["input_ids"][:, -1]
    == tokenizer.additional_special_tokens_ids[...]
):
    raise ValueError(
        "Only one image at the end of sample, so labels will all be -100"
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment