Implementation:Vibrantlabsai Ragas NoiseSensitivity
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
NoiseSensitivity is a metric that measures how susceptible an LLM system is to noise in retrieved contexts, detecting whether incorrect answer statements originate from relevant or irrelevant retrieved passages.
Description
This metric quantifies the degree to which noisy or misleading information in retrieved contexts causes the LLM to produce incorrect responses. It operates in two modes: relevant (default) and irrelevant, each measuring a different aspect of noise sensitivity.
The algorithm works through the following steps:
- Statement Decomposition: Both the reference answer and the generated response are decomposed into individual statements using a StatementGeneratorPrompt (reused from the Faithfulness metric).
- Faithfulness Evaluation: For each retrieved context, the metric uses an NLIStatementPrompt to determine which statements from both the reference and the response are supported by that context, producing binary verdict arrays.
- Cross-reference Matrix Construction: Three boolean matrices are built:
retrieved2ground_truth- which reference statements are supported by each retrieved contextretrieved2answer- which response statements are supported by each retrieved contextground_truth2answer- which response statements are supported by the reference answer
- Score Computation: Incorrect statements (those not supported by the reference) are identified. Then, depending on the mode:
- Relevant mode: Computes the proportion of incorrect statements that are faithful to relevant retrieved contexts (contexts that support at least one ground truth statement).
- Irrelevant mode: Computes the proportion of incorrect statements that are faithful to irrelevant retrieved contexts (contexts that do not support any ground truth statement), excluding those also explained by relevant contexts.
A lower score indicates better performance, meaning the system is less sensitive to noise.
Usage
Use this metric when you want to measure how robust a RAG system is against noisy retrieval. The "relevant" mode identifies cases where relevant contexts mislead the model into producing incorrect answers, while the "irrelevant" mode identifies cases where irrelevant contexts introduce errors. This requires a reference answer for comparison.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_noise_sensitivity.py
Signature
@dataclass
class NoiseSensitivity(MetricWithLLM, SingleTurnMetric):
name: str = "noise_sensitivity"
mode: t.Literal["relevant", "irrelevant"] = "relevant"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.SINGLE_TURN: {
"user_input",
"response",
"reference",
"retrieved_contexts",
}
}
)
output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
nli_statements_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
statement_generator_prompt: PydanticPrompt = field(
default_factory=StatementGeneratorPrompt
)
max_retries: int = 1
Import
from ragas.metrics import NoiseSensitivity
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| user_input | str | Yes | The original user query or question |
| response | str | Yes | The AI-generated response to evaluate |
| reference | str | Yes | The ground truth reference answer |
| retrieved_contexts | list[str] | Yes | The list of retrieved contexts (may contain both relevant and irrelevant passages) |
Configuration
| Name | Type | Default | Description |
|---|---|---|---|
| mode | Literal["relevant", "irrelevant"] | "relevant" | Whether to measure noise sensitivity from relevant or irrelevant contexts |
| max_retries | int | 1 | Maximum number of retries for LLM calls |
| nli_statements_prompt | PydanticPrompt | NLIStatementPrompt() | The prompt used for natural language inference evaluation |
| statement_generator_prompt | PydanticPrompt | StatementGeneratorPrompt() | The prompt used for decomposing text into statements |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A value between 0.0 and 1.0 representing the proportion of incorrect statements attributable to noise; lower is better |
Usage Examples
Basic Usage (Relevant Mode)
from ragas.metrics import NoiseSensitivity
from ragas.dataset_schema import SingleTurnSample
metric = NoiseSensitivity(mode="relevant")
# metric.llm = your_llm
sample = SingleTurnSample(
user_input="What is the capital of France?",
response="The capital of France is Paris, and it was founded in 1000 BC.",
reference="The capital of France is Paris.",
retrieved_contexts=[
"Paris is the capital and largest city of France.",
"France is known for its wine and cheese production.",
],
)
# score = await metric.single_turn_ascore(sample)
Irrelevant Mode
from ragas.metrics import NoiseSensitivity
# Measure sensitivity to irrelevant context noise
metric = NoiseSensitivity(mode="irrelevant")
# metric.llm = your_llm