Implementation:Vibrantlabsai Ragas MultiModalRelevance
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
MultiModalRelevance is a metric that evaluates whether an AI response to a user query is relevant and consistent with both the visual (image) and textual context information provided.
Description
This metric uses an LLM with multi-modal capabilities to assess whether the response to a query aligns with the retrieved contexts, which may contain both images and text. Unlike MultiModalFaithfulness, this metric also considers the user_input (the original question) when making its relevance determination, checking that the answer is actually responsive to what was asked.
The algorithm works as follows:
- The user input, response text, and retrieved contexts are packaged into a RelevanceInput Pydantic model.
- The input is sent to the configured LLM via a MultiModalRelevancePrompt that includes few-shot examples: one demonstrating a relevant answer (about Margherita pizza) and one showing an irrelevant answer (incorrect Oscar winner).
- The LLM evaluates whether the response is "in line with the images and textual context information" and returns a RelevanceOutput with a boolean
relevancefield. - The boolean is cast to a float: 1.0 for relevant, 0.0 for irrelevant. If the LLM returns no response, the score is NaN.
The prompt instructs the model: "Your task is to evaluate if the response for the query is in line with the images and textual context information provided. You have two options to answer. Either True / False."
Usage
Use this metric when evaluating multi-modal RAG systems where both the relevance to the user question and the alignment with multi-modal context matter. It is suitable for visual question answering, document understanding, and any application that retrieves both images and text to generate answers. A pre-instantiated convenience instance is available as multimodal_relevance.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_multi_modal_relevance.py
Signature
@dataclass
class MultiModalRelevance(MetricWithLLM, SingleTurnMetric):
name: str = "relevance_rate"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.SINGLE_TURN: {
"user_input",
"response",
"retrieved_contexts",
}
}
)
output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
relevance_prompt: ImageTextPrompt = MultiModalRelevancePrompt()
Import
from ragas.metrics import MultiModalRelevance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| user_input | str | Yes | The original user query or question |
| response | str | Yes | The AI-generated response to evaluate for relevance |
| retrieved_contexts | list[str] | Yes | The list of textual contexts retrieved from the knowledge base (images are handled by the ImageTextPrompt) |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | 1.0 if the response is relevant to the query and contexts, 0.0 if not, or NaN if the LLM fails to respond |
Usage Examples
Basic Usage
from ragas.metrics import MultiModalRelevance
from ragas.dataset_schema import SingleTurnSample
metric = MultiModalRelevance()
# Set up the LLM (must support multi-modal inputs)
# metric.llm = your_multimodal_llm
sample = SingleTurnSample(
user_input="What is the primary ingredient in a traditional Margherita pizza?",
response="The primary ingredients in a Margherita pizza are tomatoes, mozzarella cheese, and fresh basil.",
retrieved_contexts=[
"A traditional Margherita pizza consists of a thin crust.",
"The main toppings include tomatoes, mozzarella cheese, fresh basil, salt, and olive oil.",
],
)
# score = await metric.single_turn_ascore(sample)
# score will be 1.0 (relevant) or 0.0 (irrelevant)
Using the Pre-instantiated Instance
from ragas.metrics._multi_modal_relevance import multimodal_relevance
# multimodal_relevance is a ready-to-use MultiModalRelevance instance
# multimodal_relevance.llm = your_multimodal_llm