Implementation:Datajuicer Data juicer DialogTopicDetectionMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for detecting and labeling discussion topics in multi-turn dialogs provided by Data-Juicer.
Description
DialogTopicDetectionMapper is a mapper operator that detects and labels the discussion topics for each user query turn in a multi-turn dialog using an API-based language model (default: GPT-4o). It reconstructs the dialog from history, query, and response keys, constructs a prompt with a Chinese few-shot system prompt demonstrating topic detection (e.g., history, geography), sends it to the API model, and parses the response using regex to extract topic analysis and topic category labels for each turn. Results are stored in metadata under dialog_topic_labels and dialog_topic_labels_analysis. Supports optional candidate topic categories and configurable retry attempts. It extends the Mapper base class.
Usage
Import when you need to enrich conversational datasets with per-turn topic annotations for topic-based filtering and categorization.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/dialog_topic_detection_mapper.py
Signature
@OPERATORS.register_module("dialog_topic_detection_mapper")
class DialogTopicDetectionMapper(Mapper):
def __init__(self,
api_model: str = "gpt-4o",
topic_candidates: Optional[List[str]] = None,
max_round: NonNegativeInt = 10,
*,
labels_key: str = MetaKeys.dialog_topic_labels,
analysis_key: str = MetaKeys.dialog_topic_labels_analysis,
api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
system_prompt: Optional[str] = None,
query_template: Optional[str] = None,
response_template: Optional[str] = None,
candidate_template: Optional[str] = None,
analysis_template: Optional[str] = None,
labels_template: Optional[str] = None,
analysis_pattern: Optional[str] = None,
labels_pattern: Optional[str] = None,
try_num: PositiveInt = 3,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.mapper.dialog_topic_detection_mapper import DialogTopicDetectionMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name. Default: "gpt-4o" |
| topic_candidates | Optional[List[str]] | No | Output topic candidates. Uses open-domain topic labels if None |
| max_round | NonNegativeInt | No | Maximum number of dialog rounds to include in the prompt. Default: 10 |
| labels_key | str | No | Key name in meta field to store output labels. Default: "dialog_topic_labels" |
| analysis_key | str | No | Key name in meta field to store analysis. Default: "dialog_topic_labels_analysis" |
| api_endpoint | Optional[str] | No | URL endpoint for the API |
| response_path | Optional[str] | No | Path to extract content from the API response |
| system_prompt | Optional[str] | No | System prompt for the task |
| try_num | PositiveInt | No | Number of retry attempts on API call error. Default: 3 |
| model_params | Dict | No | Parameters for initializing the API model |
| sampling_params | Dict | No | Extra parameters passed to the API call (e.g. temperature, top_p) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with dialog_topic_labels and dialog_topic_labels_analysis added to metadata |
Usage Examples
YAML Configuration
process:
- dialog_topic_detection_mapper:
api_model: gpt-4o
max_round: 10
try_num: 3