Implementation:Vibrantlabsai Ragas NoiseSensitivity

Knowledge Sources	Vibrantlabsai_Ragas
Domains	Evaluation, Metrics
Last Updated	2026-02-12 00:00 GMT

Overview

NoiseSensitivity is a metric that measures how susceptible an LLM system is to noise in retrieved contexts, detecting whether incorrect answer statements originate from relevant or irrelevant retrieved passages.

Description

This metric quantifies the degree to which noisy or misleading information in retrieved contexts causes the LLM to produce incorrect responses. It operates in two modes: relevant (default) and irrelevant, each measuring a different aspect of noise sensitivity.

The algorithm works through the following steps:

Statement Decomposition: Both the reference answer and the generated response are decomposed into individual statements using a StatementGeneratorPrompt (reused from the Faithfulness metric).
Faithfulness Evaluation: For each retrieved context, the metric uses an NLIStatementPrompt to determine which statements from both the reference and the response are supported by that context, producing binary verdict arrays.
Cross-reference Matrix Construction: Three boolean matrices are built:
- retrieved2ground_truth - which reference statements are supported by each retrieved context
- retrieved2answer - which response statements are supported by each retrieved context
- ground_truth2answer - which response statements are supported by the reference answer
Score Computation: Incorrect statements (those not supported by the reference) are identified. Then, depending on the mode:
- Relevant mode: Computes the proportion of incorrect statements that are faithful to relevant retrieved contexts (contexts that support at least one ground truth statement).
- Irrelevant mode: Computes the proportion of incorrect statements that are faithful to irrelevant retrieved contexts (contexts that do not support any ground truth statement), excluding those also explained by relevant contexts.

A lower score indicates better performance, meaning the system is less sensitive to noise.

Usage

Use this metric when you want to measure how robust a RAG system is against noisy retrieval. The "relevant" mode identifies cases where relevant contexts mislead the model into producing incorrect answers, while the "irrelevant" mode identifies cases where irrelevant contexts introduce errors. This requires a reference answer for comparison.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/metrics/_noise_sensitivity.py

Signature

@dataclass
class NoiseSensitivity(MetricWithLLM, SingleTurnMetric):
    name: str = "noise_sensitivity"
    mode: t.Literal["relevant", "irrelevant"] = "relevant"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.SINGLE_TURN: {
                "user_input",
                "response",
                "reference",
                "retrieved_contexts",
            }
        }
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
    nli_statements_prompt: PydanticPrompt = field(default_factory=NLIStatementPrompt)
    statement_generator_prompt: PydanticPrompt = field(
        default_factory=StatementGeneratorPrompt
    )
    max_retries: int = 1

Import

from ragas.metrics import NoiseSensitivity

I/O Contract

Inputs

Name	Type	Required	Description
user_input	str	Yes	The original user query or question
response	str	Yes	The AI-generated response to evaluate
reference	str	Yes	The ground truth reference answer
retrieved_contexts	list[str]	Yes	The list of retrieved contexts (may contain both relevant and irrelevant passages)

Configuration

Name	Type	Default	Description
mode	Literal["relevant", "irrelevant"]	"relevant"	Whether to measure noise sensitivity from relevant or irrelevant contexts
max_retries	int	1	Maximum number of retries for LLM calls
nli_statements_prompt	PydanticPrompt	NLIStatementPrompt()	The prompt used for natural language inference evaluation
statement_generator_prompt	PydanticPrompt	StatementGeneratorPrompt()	The prompt used for decomposing text into statements

Outputs

Name	Type	Description
score	float	A value between 0.0 and 1.0 representing the proportion of incorrect statements attributable to noise; lower is better

Usage Examples

Basic Usage (Relevant Mode)

from ragas.metrics import NoiseSensitivity
from ragas.dataset_schema import SingleTurnSample

metric = NoiseSensitivity(mode="relevant")
# metric.llm = your_llm

sample = SingleTurnSample(
    user_input="What is the capital of France?",
    response="The capital of France is Paris, and it was founded in 1000 BC.",
    reference="The capital of France is Paris.",
    retrieved_contexts=[
        "Paris is the capital and largest city of France.",
        "France is known for its wine and cheese production.",
    ],
)

# score = await metric.single_turn_ascore(sample)

Irrelevant Mode

from ragas.metrics import NoiseSensitivity

# Measure sensitivity to irrelevant context noise
metric = NoiseSensitivity(mode="irrelevant")
# metric.llm = your_llm

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment