Implementation:Explodinggradients Ragas TopicAdherenceScore Metric
TopicAdherenceScore Metric
TopicAdherenceScore is a multi-turn evaluation metric in the Ragas library that measures whether an AI agent's responses stay within designated topic boundaries. It uses an LLM-based pipeline to extract discussed topics, detect refusals, classify topic scope, and compute a precision, recall, or F1 score.
Source Location
- File:
src/ragas/metrics/_topic_adherence.py(lines 135-251) - Repository: explodinggradients/ragas
Import
from ragas.metrics import TopicAdherenceScore
Class Definition
@dataclass
class TopicAdherenceScore(MetricWithLLM, MultiTurnMetric):
name: str = "topic_adherence"
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.MULTI_TURN: {
"user_input",
"reference_topics",
}
}
)
output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
mode: t.Literal["precision", "recall", "f1"] = "f1"
topic_extraction_prompt: PydanticPrompt = TopicExtractionPrompt()
topic_classification_prompt: PydanticPrompt = TopicClassificationPrompt()
topic_refused_prompt: PydanticPrompt = TopicRefusedPrompt()
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
Literal["precision", "recall", "f1"] |
"f1" |
The scoring mode: precision penalizes off-topic answers, recall penalizes refusing in-scope topics, F1 balances both. |
topic_extraction_prompt |
PydanticPrompt |
TopicExtractionPrompt() |
The prompt used to extract discussed topics from the conversation. |
topic_classification_prompt |
PydanticPrompt |
TopicClassificationPrompt() |
The prompt used to classify extracted topics against reference topics. |
topic_refused_prompt |
PydanticPrompt |
TopicRefusedPrompt() |
The prompt used to detect whether the agent refused to answer on each topic. |
Required Columns
The metric requires a MultiTurnSample with the following fields:
user_input-- list of conversation messagesreference_topics-- list of allowed topic strings
Key Method: _multi_turn_ascore
async def _multi_turn_ascore(
self, sample: MultiTurnSample, callbacks: Callbacks
) -> float
The primary scoring method follows a three-phase pipeline:
Phase 1: Topic Extraction
prompt_input = TopicExtractionInput(user_input=user_input)
response = await self.topic_extraction_prompt.generate(
data=prompt_input, llm=self.llm, callbacks=callbacks
)
topics = response.topics
Serializes the conversation via sample.pretty_repr() and uses the LLM to extract a list of discussed topics.
Phase 2: Refusal Detection
for topic in topics:
prompt_input = TopicRefusedInput(user_input=user_input, topic=topic)
response = await self.topic_refused_prompt.generate(
data=prompt_input, llm=self.llm, callbacks=callbacks
)
topic_answered_verdict.append(response.refused_to_answer)
For each extracted topic, queries the LLM to determine whether the agent refused to answer. The result is inverted (True means the agent answered, False means it refused).
Phase 3: Topic Classification and Scoring
prompt_input = TopicClassificationInput(
reference_topics=sample.reference_topics, topics=topics
)
topic_classifications_response = await self.topic_classification_prompt.generate(
data=prompt_input, llm=self.llm, callbacks=callbacks
)
Classifies each extracted topic against the reference topics. The classifications are then combined with the refusal verdicts using element-wise boolean operations:
true_positives = sum(topic_answered_verdict & topic_classifications)
false_positives = sum(topic_answered_verdict & ~topic_classifications)
false_negatives = sum(~topic_answered_verdict & topic_classifications)
The final score is computed based on the selected mode:
- precision:
TP / (TP + FP + 1e-10) - recall:
TP / (TP + FN + 1e-10) - f1:
2 * precision * recall / (precision + recall + 1e-10)
A small epsilon (1e-10) is added to denominators to avoid division by zero.
Supporting Prompt Classes
| Prompt Class | Input Model | Output Model | Purpose |
|---|---|---|---|
TopicExtractionPrompt |
TopicExtractionInput |
TopicExtractionOutput |
Extracts topics from conversation text |
TopicRefusedPrompt |
TopicRefusedInput |
TopicRefusedOutput |
Detects whether the agent refused to answer a topic |
TopicClassificationPrompt |
TopicClassificationInput |
TopicClassificationOutput |
Classifies topics against reference topics |
Safe Boolean Conversion
The method includes an internal safe_bool_conversion function (lines 189-222) that handles edge cases in the classification output. LLM responses may return boolean values as actual booleans, integers, or strings ("true", "1", "yes"). The function normalizes all these representations to a consistent numpy boolean array, preventing TypeError during bitwise operations.
Usage Example
from ragas.metrics import TopicAdherenceScore
from ragas.dataset_schema import MultiTurnSample
from ragas.messages import HumanMessage, AIMessage, ToolCall, ToolMessage
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
# Configure LLM
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4"))
# Create metric with precision mode
metric = TopicAdherenceScore(mode="precision")
metric.llm = evaluator_llm
# Create sample with reference topics
sample = MultiTurnSample(
user_input=[
HumanMessage(content="Tell me about Einstein's theory of relativity"),
AIMessage(
content="Let me look that up.",
tool_calls=[ToolCall(name="document_search", args={"query": "relativity"})]
),
ToolMessage(content="Found documents on relativity."),
AIMessage(content="Einstein's theory describes how gravity warps spacetime.")
],
reference_topics=["Physics", "Mathematics"]
)
# Evaluate
# score = await metric._multi_turn_ascore(sample, callbacks=[])
Score Interpretation
| Score | Meaning |
|---|---|
| 1.0 | Perfect adherence -- all answered topics are in-scope and all in-scope topics were addressed |
| 0.0 | No adherence -- the agent only discussed out-of-scope topics or refused all in-scope topics |
| 0.0 < score < 1.0 | Partial adherence -- some combination of off-topic discussion or missed in-scope topics |
Internal Dependencies
ragas.metrics.base.MetricWithLLM-- provides LLM integrationragas.metrics.base.MultiTurnMetric-- base class for multi-turn metricsragas.prompt.PydanticPrompt-- structured prompt frameworkragas.dataset_schema.MultiTurnSample-- input sample schemanumpy-- used for boolean array operations
Implements
See Also
- AgentGoalAccuracy Metric -- evaluating goal achievement
- MultiTurnSample Class -- the data schema for multi-turn evaluation samples