Heuristic:Mlfoundations Open flamingo Hungarian MMC4 Image Text Matching
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Computer_Vision, Vision_Language |
| Last Updated | 2026-02-08 03:30 GMT |
Overview
Optimal one-to-one image-sentence matching in MMC4 interleaved documents using the Hungarian algorithm, with similarity threshold filtering to reject low-quality matches.
Description
MMC4 documents contain multiple images and sentences with a pre-computed similarity matrix (images x sentences). To create training sequences, OpenFlamingo uses the Hungarian algorithm (`scipy.optimize.linear_sum_assignment`) to find the optimal one-to-one assignment between images and sentences that maximizes total similarity. After matching, pairs with similarity below a configurable threshold (`--mmc4_textsim_threshold`, default 30, reference script uses 0.24) are discarded. This ensures each image is paired with its most relevant sentence, and low-quality matches are filtered out.
Usage
Apply this heuristic when preprocessing MMC4 interleaved image-text data for training. It is automatically applied in the `preprocess_interleaved` function during data loading. The threshold should be tuned based on the similarity matrix scale (varies by CLIP model used for pre-computation).
The Insight (Rule of Thumb)
- Action: Use `linear_sum_assignment(-sim_matrix)` for optimal image-sentence pairing. Filter pairs with `sim_score < sim_threshold`.
- Value: `--mmc4_textsim_threshold 0.24` (reference training script). Additional filters: `--mmc4_min_num_images 1`, `--mmc4_max_num_images 6`. 50% random rejection of single-image samples.
- Trade-off: Higher thresholds mean better quality matches but fewer training samples. Lower thresholds include more data but with noisier correspondences.
- Additional filters: Small images (< 10KB) are discarded. Samples with no valid images after filtering are skipped. Single-image samples at the end of a sequence are rejected (all labels would be -100).
Reasoning
MMC4 documents are web pages with multiple images and text paragraphs. Not every image is relevant to every sentence. A naive approach of placing images at arbitrary positions would create noisy training signals. The Hungarian algorithm provides the globally optimal one-to-one assignment that maximizes total image-text similarity. The threshold filter then removes matches that are too weak, ensuring the model only trains on genuine image-text correspondences.
The 50% random rejection of single-image samples (data.py:248-251) prevents the training data from being dominated by simple single-image cases, encouraging the model to learn from multi-image interleaved examples.
Code Evidence
Hungarian matching from `open_flamingo/train/data.py:179-195`:
sim_matrix = np.array(sim_matrix) # of shape images x sentences
sim_matrix = sim_matrix[valid_image_indices]
# negate the similarities to turn them into costs
cost_matrix = -sim_matrix
# find one to one assignments
image_indices, sentence_indices = linear_sum_assignment(cost_matrix)
images, sentence_ixs = [], []
for i, sim_ix in zip(image_indices, sentence_indices):
sim_score = sim_matrix[i][sim_ix]
if sim_score < sim_threshold:
continue
images.append(valid_images[i])
sentence_ixs.append(sim_ix)
Minimum image size filter from `open_flamingo/train/data.py:168-170`:
# filter to images >= 10KB
if len(rawbytes) // 1000 <= MIN_KB:
continue
Single-image random rejection from `open_flamingo/train/data.py:248-251`:
elif (
num_images == 1 and random.random() <= 0.5
): # 50% chance of keeping single image samples
raise ValueError("Only one image in sample")
Edge case: single image at end from `open_flamingo/train/data.py:254-263`:
# avoid the situation where there's one <image> token and it's at the end
if (
num_images == 1
and text_tensor["input_ids"][:, -1]
== tokenizer.additional_special_tokens_ids[...]
):
raise ValueError(
"Only one image at the end of sample, so labels will all be -100"
)