Implementation:Vibrantlabsai Ragas NoiseSensitivityV2
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Measures how often an LLM system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents, using statement decomposition and natural language inference.
Description
The NoiseSensitivity metric (V2 collections implementation) evaluates how susceptible an LLM system is to producing incorrect answers when given noisy or irrelevant context. The metric operates in two modes: relevant (default) and irrelevant, each measuring a different aspect of noise sensitivity.
The evaluation follows a multi-step process:
Step 1 - Statement Decomposition: Both the reference (ground truth) and the response are decomposed into atomic statements using the StatementGeneratorPrompt. The LLM breaks each text into individual factual claims.
Step 2 - Faithfulness Evaluation: Each atomic statement from both the reference and the response is evaluated against each retrieved context using natural language inference (NLI) via the StatementFaithfulnessPrompt. This produces a verdict (1 for faithful, 0 for not faithful) for each statement-context pair.
Step 3 - Matrix Construction: The results are organized into boolean matrices:
- retrieved2ground_truth: Which ground truth statements are supported by each retrieved context
- retrieved2answer: Which answer statements are supported by each retrieved context
- ground_truth2answer: Which answer statements are supported by the ground truth reference
Step 4 - Score Computation: The final score depends on the mode:
- Relevant mode: Measures incorrect claims that come from relevant retrieved contexts. A retrieved context is considered "relevant" if it supports at least one ground truth statement. The score is the mean of (relevant_faithful AND incorrect).
- Irrelevant mode: Measures incorrect claims that come from irrelevant retrieved contexts. Irrelevant contexts are those that do not support any ground truth statement. The score is the mean of (irrelevant_faithful AND NOT relevant_faithful AND incorrect).
A lower score is better for both modes, as a high score indicates the system is making more errors from noisy contexts.
Usage
Use this metric to evaluate the robustness of a RAG system against noisy or irrelevant retrieved contexts. In relevant mode, it measures how often the system generates incorrect statements from relevant contexts (perhaps due to misinterpretation). In irrelevant mode, it measures how often the system is misled by irrelevant contexts into generating incorrect statements.
This is the V2 collections version which uses modern instructor LLMs with structured output for statement decomposition and NLI evaluation, replacing the legacy V1 implementation.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/collections/noise_sensitivity/metric.py
Signature
class NoiseSensitivity(BaseMetric):
def __init__(
self,
llm: "InstructorBaseRagasLLM",
name: str = "noise_sensitivity",
mode: Literal["relevant", "irrelevant"] = "relevant",
**kwargs,
): ...
async def ascore(
self,
user_input: str,
response: str,
reference: str,
retrieved_contexts: List[str],
) -> MetricResult: ...
Import
from ragas.metrics.collections import NoiseSensitivity
I/O Contract
Constructor Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| llm | InstructorBaseRagasLLM | Yes | Modern instructor-based LLM used for statement generation and NLI evaluation |
| name | str | No | Metric name (default: "noise_sensitivity") |
| mode | Literal["relevant", "irrelevant"] | No | Evaluation mode (default: "relevant"). "relevant" measures errors from relevant contexts; "irrelevant" measures errors from irrelevant contexts |
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| user_input | str | Yes | The original question posed by the user. Must be non-empty |
| response | str | Yes | The generated response to evaluate. Must be non-empty |
| reference | str | Yes | The ground truth reference answer. Must be non-empty |
| retrieved_contexts | List[str] | Yes | List of retrieved context strings used to generate the response. Must be non-empty |
Outputs
| Name | Type | Description |
|---|---|---|
| score | MetricResult (float value) | Noise sensitivity score between 0.0 and 1.0. Lower is better. Indicates the proportion of incorrect answer statements attributable to noisy contexts |
Usage Examples
Basic Usage (Relevant Mode)
from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import NoiseSensitivity
# Setup dependencies
client = AsyncOpenAI()
llm = llm_factory("openai", client=client, model="gpt-4o-mini")
# Create metric instance (default: relevant mode)
metric = NoiseSensitivity(llm=llm)
# Single evaluation
result = await metric.ascore(
user_input="What is LIC known for?",
response="LIC is the largest insurance company in India, known for its wide range of policies.",
reference="LIC is known for managing large-scale investments and providing insurance.",
retrieved_contexts=[
"LIC was established in 1956 by the Government of India.",
"LIC offers a variety of insurance products including life, health, and pension plans.",
"The stock market in India is regulated by SEBI.",
]
)
print(f"Noise Sensitivity (relevant): {result.value}")
Irrelevant Mode
from ragas.metrics.collections import NoiseSensitivity
# Measure sensitivity to irrelevant contexts
metric = NoiseSensitivity(llm=llm, mode="irrelevant")
result = await metric.ascore(
user_input="What is LIC known for?",
response="LIC is the largest insurance company in India, also involved in stock trading.",
reference="LIC is known for managing large-scale investments and providing insurance.",
retrieved_contexts=[
"LIC was established in 1956 by the Government of India.",
"LIC offers a variety of insurance products.",
"The stock market in India is regulated by SEBI.",
]
)
print(f"Noise Sensitivity (irrelevant): {result.value}")