Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas NoiseSensitivityV2

From Leeroopedia
Knowledge Sources
Domains Evaluation, Metrics
Last Updated 2026-02-12 00:00 GMT

Overview

Measures how often an LLM system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents, using statement decomposition and natural language inference.

Description

The NoiseSensitivity metric (V2 collections implementation) evaluates how susceptible an LLM system is to producing incorrect answers when given noisy or irrelevant context. The metric operates in two modes: relevant (default) and irrelevant, each measuring a different aspect of noise sensitivity.

The evaluation follows a multi-step process:

Step 1 - Statement Decomposition: Both the reference (ground truth) and the response are decomposed into atomic statements using the StatementGeneratorPrompt. The LLM breaks each text into individual factual claims.

Step 2 - Faithfulness Evaluation: Each atomic statement from both the reference and the response is evaluated against each retrieved context using natural language inference (NLI) via the StatementFaithfulnessPrompt. This produces a verdict (1 for faithful, 0 for not faithful) for each statement-context pair.

Step 3 - Matrix Construction: The results are organized into boolean matrices:

  • retrieved2ground_truth: Which ground truth statements are supported by each retrieved context
  • retrieved2answer: Which answer statements are supported by each retrieved context
  • ground_truth2answer: Which answer statements are supported by the ground truth reference

Step 4 - Score Computation: The final score depends on the mode:

  • Relevant mode: Measures incorrect claims that come from relevant retrieved contexts. A retrieved context is considered "relevant" if it supports at least one ground truth statement. The score is the mean of (relevant_faithful AND incorrect).
  • Irrelevant mode: Measures incorrect claims that come from irrelevant retrieved contexts. Irrelevant contexts are those that do not support any ground truth statement. The score is the mean of (irrelevant_faithful AND NOT relevant_faithful AND incorrect).

A lower score is better for both modes, as a high score indicates the system is making more errors from noisy contexts.

Usage

Use this metric to evaluate the robustness of a RAG system against noisy or irrelevant retrieved contexts. In relevant mode, it measures how often the system generates incorrect statements from relevant contexts (perhaps due to misinterpretation). In irrelevant mode, it measures how often the system is misled by irrelevant contexts into generating incorrect statements.

This is the V2 collections version which uses modern instructor LLMs with structured output for statement decomposition and NLI evaluation, replacing the legacy V1 implementation.

Code Reference

Source Location

  • Repository: Vibrantlabsai_Ragas
  • File: src/ragas/metrics/collections/noise_sensitivity/metric.py

Signature

class NoiseSensitivity(BaseMetric):
    def __init__(
        self,
        llm: "InstructorBaseRagasLLM",
        name: str = "noise_sensitivity",
        mode: Literal["relevant", "irrelevant"] = "relevant",
        **kwargs,
    ): ...

    async def ascore(
        self,
        user_input: str,
        response: str,
        reference: str,
        retrieved_contexts: List[str],
    ) -> MetricResult: ...

Import

from ragas.metrics.collections import NoiseSensitivity

I/O Contract

Constructor Parameters

Name Type Required Description
llm InstructorBaseRagasLLM Yes Modern instructor-based LLM used for statement generation and NLI evaluation
name str No Metric name (default: "noise_sensitivity")
mode Literal["relevant", "irrelevant"] No Evaluation mode (default: "relevant"). "relevant" measures errors from relevant contexts; "irrelevant" measures errors from irrelevant contexts

Inputs

Name Type Required Description
user_input str Yes The original question posed by the user. Must be non-empty
response str Yes The generated response to evaluate. Must be non-empty
reference str Yes The ground truth reference answer. Must be non-empty
retrieved_contexts List[str] Yes List of retrieved context strings used to generate the response. Must be non-empty

Outputs

Name Type Description
score MetricResult (float value) Noise sensitivity score between 0.0 and 1.0. Lower is better. Indicates the proportion of incorrect answer statements attributable to noisy contexts

Usage Examples

Basic Usage (Relevant Mode)

from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import NoiseSensitivity

# Setup dependencies
client = AsyncOpenAI()
llm = llm_factory("openai", client=client, model="gpt-4o-mini")

# Create metric instance (default: relevant mode)
metric = NoiseSensitivity(llm=llm)

# Single evaluation
result = await metric.ascore(
    user_input="What is LIC known for?",
    response="LIC is the largest insurance company in India, known for its wide range of policies.",
    reference="LIC is known for managing large-scale investments and providing insurance.",
    retrieved_contexts=[
        "LIC was established in 1956 by the Government of India.",
        "LIC offers a variety of insurance products including life, health, and pension plans.",
        "The stock market in India is regulated by SEBI.",
    ]
)
print(f"Noise Sensitivity (relevant): {result.value}")

Irrelevant Mode

from ragas.metrics.collections import NoiseSensitivity

# Measure sensitivity to irrelevant contexts
metric = NoiseSensitivity(llm=llm, mode="irrelevant")

result = await metric.ascore(
    user_input="What is LIC known for?",
    response="LIC is the largest insurance company in India, also involved in stock trading.",
    reference="LIC is known for managing large-scale investments and providing insurance.",
    retrieved_contexts=[
        "LIC was established in 1956 by the Government of India.",
        "LIC offers a variety of insurance products.",
        "The stock market in India is regulated by SEBI.",
    ]
)
print(f"Noise Sensitivity (irrelevant): {result.value}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment