Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Explodinggradients Ragas TopicAdherenceScore Metric

From Leeroopedia


TopicAdherenceScore Metric

TopicAdherenceScore is a multi-turn evaluation metric in the Ragas library that measures whether an AI agent's responses stay within designated topic boundaries. It uses an LLM-based pipeline to extract discussed topics, detect refusals, classify topic scope, and compute a precision, recall, or F1 score.

Source Location

Import

from ragas.metrics import TopicAdherenceScore

Class Definition

@dataclass
class TopicAdherenceScore(MetricWithLLM, MultiTurnMetric):
    name: str = "topic_adherence"
    _required_columns: t.Dict[MetricType, t.Set[str]] = field(
        default_factory=lambda: {
            MetricType.MULTI_TURN: {
                "user_input",
                "reference_topics",
            }
        }
    )
    output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
    mode: t.Literal["precision", "recall", "f1"] = "f1"
    topic_extraction_prompt: PydanticPrompt = TopicExtractionPrompt()
    topic_classification_prompt: PydanticPrompt = TopicClassificationPrompt()
    topic_refused_prompt: PydanticPrompt = TopicRefusedPrompt()

Constructor Parameters

Parameter Type Default Description
mode Literal["precision", "recall", "f1"] "f1" The scoring mode: precision penalizes off-topic answers, recall penalizes refusing in-scope topics, F1 balances both.
topic_extraction_prompt PydanticPrompt TopicExtractionPrompt() The prompt used to extract discussed topics from the conversation.
topic_classification_prompt PydanticPrompt TopicClassificationPrompt() The prompt used to classify extracted topics against reference topics.
topic_refused_prompt PydanticPrompt TopicRefusedPrompt() The prompt used to detect whether the agent refused to answer on each topic.

Required Columns

The metric requires a MultiTurnSample with the following fields:

  • user_input -- list of conversation messages
  • reference_topics -- list of allowed topic strings

Key Method: _multi_turn_ascore

async def _multi_turn_ascore(
    self, sample: MultiTurnSample, callbacks: Callbacks
) -> float

The primary scoring method follows a three-phase pipeline:

Phase 1: Topic Extraction

prompt_input = TopicExtractionInput(user_input=user_input)
response = await self.topic_extraction_prompt.generate(
    data=prompt_input, llm=self.llm, callbacks=callbacks
)
topics = response.topics

Serializes the conversation via sample.pretty_repr() and uses the LLM to extract a list of discussed topics.

Phase 2: Refusal Detection

for topic in topics:
    prompt_input = TopicRefusedInput(user_input=user_input, topic=topic)
    response = await self.topic_refused_prompt.generate(
        data=prompt_input, llm=self.llm, callbacks=callbacks
    )
    topic_answered_verdict.append(response.refused_to_answer)

For each extracted topic, queries the LLM to determine whether the agent refused to answer. The result is inverted (True means the agent answered, False means it refused).

Phase 3: Topic Classification and Scoring

prompt_input = TopicClassificationInput(
    reference_topics=sample.reference_topics, topics=topics
)
topic_classifications_response = await self.topic_classification_prompt.generate(
    data=prompt_input, llm=self.llm, callbacks=callbacks
)

Classifies each extracted topic against the reference topics. The classifications are then combined with the refusal verdicts using element-wise boolean operations:

true_positives = sum(topic_answered_verdict & topic_classifications)
false_positives = sum(topic_answered_verdict & ~topic_classifications)
false_negatives = sum(~topic_answered_verdict & topic_classifications)

The final score is computed based on the selected mode:

  • precision: TP / (TP + FP + 1e-10)
  • recall: TP / (TP + FN + 1e-10)
  • f1: 2 * precision * recall / (precision + recall + 1e-10)

A small epsilon (1e-10) is added to denominators to avoid division by zero.

Supporting Prompt Classes

Prompt Class Input Model Output Model Purpose
TopicExtractionPrompt TopicExtractionInput TopicExtractionOutput Extracts topics from conversation text
TopicRefusedPrompt TopicRefusedInput TopicRefusedOutput Detects whether the agent refused to answer a topic
TopicClassificationPrompt TopicClassificationInput TopicClassificationOutput Classifies topics against reference topics

Safe Boolean Conversion

The method includes an internal safe_bool_conversion function (lines 189-222) that handles edge cases in the classification output. LLM responses may return boolean values as actual booleans, integers, or strings ("true", "1", "yes"). The function normalizes all these representations to a consistent numpy boolean array, preventing TypeError during bitwise operations.

Usage Example

from ragas.metrics import TopicAdherenceScore
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Configure LLM
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4"))

# Create metric with precision mode
metric = TopicAdherenceScore(mode="precision")
metric.llm = evaluator_llm

# Create sample with reference topics
sample = MultiTurnSample(
    user_input=[
        HumanMessage(content="Tell me about Einstein's theory of relativity"),
        AIMessage(
            content="Let me look that up.",
            tool_calls=[ToolCall(name="document_search", args={"query": "relativity"})]
        ),
        ToolMessage(content="Found documents on relativity."),
        AIMessage(content="Einstein's theory describes how gravity warps spacetime.")
    ],
    reference_topics=["Physics", "Mathematics"]
)

# Evaluate
# score = await metric._multi_turn_ascore(sample, callbacks=[])

Score Interpretation

Score Meaning
1.0 Perfect adherence -- all answered topics are in-scope and all in-scope topics were addressed
0.0 No adherence -- the agent only discussed out-of-scope topics or refused all in-scope topics
0.0 < score < 1.0 Partial adherence -- some combination of off-topic discussion or missed in-scope topics

Internal Dependencies

  • ragas.metrics.base.MetricWithLLM -- provides LLM integration
  • ragas.metrics.base.MultiTurnMetric -- base class for multi-turn metrics
  • ragas.prompt.PydanticPrompt -- structured prompt framework
  • ragas.dataset_schema.MultiTurnSample -- input sample schema
  • numpy -- used for boolean array operations

Implements

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment