Implementation:Vibrantlabsai Ragas InstanceRubrics
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
InstanceRubrics evaluates LLM responses using per-sample scoring rubrics that are provided as part of each evaluation instance, enabling fine-grained, instance-specific evaluation criteria.
Description
Unlike RubricsScore which applies a single set of rubrics across all samples, InstanceRubrics expects each evaluation sample to carry its own rubrics dictionary. This allows different scoring criteria for different questions or interaction types within the same evaluation run.
The metric works by:
- Extracting the rubrics from the sample's data (the "rubrics" field), raising a ValueError if rubrics are not provided for a sample.
- Constructing a prompt input that includes the rubrics alongside the user input, response, reference, and optionally retrieved contexts. When retrieved contexts are present, they are concatenated and appended to the user input.
- Generating a score using an LLM judge via a PydanticPrompt that maps the input (with rubrics) to a ScoreFeedback output containing both a feedback string and an integer score.
The metric supports both single-turn and multi-turn evaluation:
- Single-turn: Uses SingleTurnInputWithRubric which extends the domain-specific rubrics input model by adding a required rubrics field.
- Multi-turn: Uses MultiTurnInputWithRubric which extends the multi-turn input model. The full conversation is formatted using sample.pretty_repr() and passed with the reference and rubrics.
The key distinction from RubricsScore is that rubrics are passed as part of the prompt input data rather than being embedded in the prompt instruction. This means the LLM sees different rubrics for each sample.
Usage
Use this metric when evaluation criteria vary across samples. For example, in a dataset where some questions require factual precision while others require creative writing quality, each sample can specify its own rubric. It is also useful when subject-matter experts define per-question grading criteria.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_instance_specific_rubrics.py
Signature
class InstanceRubrics(MetricWithLLM, SingleTurnMetric, MultiTurnMetric):
def __init__(
self,
name: str = "instance_rubrics",
llm: t.Optional[BaseRagasLLM] = None,
required_columns: t.Optional[t.Dict[MetricType, t.Set[str]]] = None,
output_type: t.Optional[MetricOutputType] = MetricOutputType.DISCRETE,
single_turn_prompt: t.Optional[PydanticPrompt] = None,
multi_turn_prompt: t.Optional[PydanticPrompt] = None,
max_retries: int = 1,
):
Import
from ragas.metrics import InstanceRubrics
I/O Contract
Inputs (Single-Turn)
| Name | Type | Required | Description |
|---|---|---|---|
| rubrics | Dict[str, str] | Yes | The per-instance scoring rubric mapping score keys to descriptions |
| user_input | str | No (optional) | The user's question or query |
| response | str | No (optional) | The LLM-generated response to evaluate |
| retrieved_contexts | List[str] | No (optional) | The retrieved contexts; when present, concatenated and appended to user_input |
| reference | str | No (optional) | The ground truth reference answer |
| reference_contexts | List[str] | No (optional) | The reference contexts for evaluation |
Inputs (Multi-Turn)
| Name | Type | Required | Description |
|---|---|---|---|
| rubrics | Dict[str, str] | Yes | The per-instance scoring rubric |
| user_input | str | No (optional) | The full multi-turn interaction (formatted via pretty_repr) |
| reference | str | Yes | The reference answer for evaluation (asserted not None) |
Outputs
| Name | Type | Description |
|---|---|---|
| score | int | A discrete integer score based on the instance-specific rubric criteria |
Key Components
Input Models
| Class | Parent | Description |
|---|---|---|
| SingleTurnInputWithRubric | SingleTurnInputWithoutRubric | Extends the domain-specific rubrics input model by adding a required rubrics dictionary field |
| MultiTurnInputWithRubric | MultiTurnInputWithoutRubric | Extends the multi-turn input model by adding a required rubrics dictionary field |
Prompt Classes
| Class | Description |
|---|---|
| SingleTurnPrompt | PydanticPrompt mapping SingleTurnInputWithRubric to ScoreFeedback; instruction directs the LLM to score based on the rubric passed in the input |
| MultiTurnPrompt | PydanticPrompt mapping MultiTurnInputWithRubric to ScoreFeedback; uses the same instruction pattern |
Both prompt classes use the instruction: "Your task is to assign an appropriate score and provide feedback to the inputs based solely on the scoring criteria passed in the input." This distinguishes them from the domain-specific rubrics prompts where criteria are embedded in the instruction itself.
Reused Components
The module imports and extends models from _domain_specific_rubrics:
- SingleTurnInputWithoutRubric: Base input model for single-turn evaluation
- MultiTurnInputWithoutRubric: Base input model for multi-turn evaluation
- ScoreFeedback: Output model containing feedback text and integer score
Usage Examples
Basic Usage with Per-Instance Rubrics
from ragas.metrics import InstanceRubrics
from ragas.dataset_schema import SingleTurnSample
metric = InstanceRubrics()
# metric.llm = your_llm_instance
sample = SingleTurnSample(
user_input="Explain quantum entanglement in simple terms.",
response="Quantum entanglement is when two particles become linked and instantly affect each other regardless of distance.",
rubrics={
"score1_description": "Explanation is incorrect or incomprehensible.",
"score2_description": "Explanation has major inaccuracies.",
"score3_description": "Explanation is roughly correct but unclear.",
"score4_description": "Explanation is correct and mostly clear.",
"score5_description": "Explanation is correct, clear, and uses effective analogies.",
}
)
# score = await metric.single_turn_ascore(sample)
Multi-Turn Evaluation
from ragas.metrics import InstanceRubrics
from ragas.dataset_schema import MultiTurnSample
metric = InstanceRubrics()
# metric.llm = your_llm_instance
# multi_turn_sample = MultiTurnSample(
# ...,
# reference="expected outcome",
# rubrics={
# "score1_description": "Agent failed to complete the task.",
# "score2_description": "Agent partially completed the task with errors.",
# "score3_description": "Agent completed the task but inefficiently.",
# "score4_description": "Agent completed the task well.",
# "score5_description": "Agent completed the task perfectly and efficiently.",
# }
# )
# score = await metric.multi_turn_ascore(multi_turn_sample)