Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ExtractKeywordMapper

From Leeroopedia
Revision as of 12:20, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Datajuicer_Data_juicer_ExtractKeywordMapper.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for extracting keywords from text using an API-based language model provided by Data-Juicer.

Description

ExtractKeywordMapper is a mapper operator that generates high-level keywords summarizing the main concepts, themes, or topics of input text. It uses a prompt template adapted from LightRAG with multilingual few-shot examples (English and Chinese) to guide keyword extraction. The model output is parsed via regex to extract keywords from a structured format, and results are stored in metadata under the configured keyword key. Supports retry logic and optional text dropping.

Usage

Use when you need automatic keyword tagging for documents, enabling topic-based organization, search indexing, and content categorization of training datasets.

Code Reference

Source Location

Signature

@OPERATORS.register_module("extract_keyword_mapper")
class ExtractKeywordMapper(Mapper):
    def __init__(self,
                 api_model: str = "gpt-4o",
                 *,
                 keyword_key: str = MetaKeys.keyword,
                 api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 prompt_template: Optional[str] = None,
                 completion_delimiter: Optional[str] = None,
                 output_pattern: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 drop_text: bool = False,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.mapper.extract_keyword_mapper import ExtractKeywordMapper

I/O Contract

Inputs

Name Type Required Description
api_model str No API model name, defaults to "gpt-4o"
keyword_key str No Key name to store keywords in meta field, defaults to MetaKeys.keyword
api_endpoint Optional[str] No URL endpoint for the API
response_path Optional[str] No Path to extract content from API response
prompt_template Optional[str] No Template of input prompt for keyword extraction
completion_delimiter Optional[str] No Marker for end of output
output_pattern Optional[str] No Regular expression for parsing keywords
try_num PositiveInt No Number of retry attempts on error, defaults to 3
drop_text bool No Whether to drop text from output, defaults to False
model_params Dict No Parameters for initializing the API model
sampling_params Dict No Extra parameters passed to API call (e.g. temperature, top_p)

Outputs

Name Type Description
samples Dict Transformed samples with keywords stored in meta field under keyword_key

Usage Examples

process:
  - extract_keyword_mapper:
      api_model: "gpt-4o"
      try_num: 3
      drop_text: false

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment