Principle:Mlfoundations Open flamingo Retrieval In Context Example Selection
Overview
Retrieval-based strategy for selecting in-context demonstration examples that are visually similar to the query image using CLIP embedding similarity.
Description
RICES (Retrieval In-Context Example Selection) improves few-shot evaluation by selecting demonstration examples that are visually similar to the test query, rather than random examples. It pre-computes CLIP image features for all training examples, then at inference time computes the CLIP feature for the query image and retrieves the top-k most similar training examples via cosine similarity. This provides more relevant demonstrations that help the model understand the task better through visual analogy.
Usage
When evaluating a vision-language model in few-shot settings and wanting higher-quality in-context examples than random selection provides.
Theoretical Basis
In-context learning performance depends on the relevance of demonstration examples. CLIP's joint vision-language embedding space enables measuring visual similarity between images. By selecting demonstrations with high cosine similarity to the query in CLIP space, the model receives examples that share visual features with the query, improving prediction quality. The features are sorted so that the most similar example appears last (closest to the query in the prompt), following the recency bias in language models.