Implementation:Datajuicer Data juicer ExtractKeywordMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for extracting keywords from text using an API-based language model provided by Data-Juicer.
Description
ExtractKeywordMapper is a mapper operator that generates high-level keywords summarizing the main concepts, themes, or topics of input text. It uses a prompt template adapted from LightRAG with multilingual few-shot examples (English and Chinese) to guide keyword extraction. The model output is parsed via regex to extract keywords from a structured format, and results are stored in metadata under the configured keyword key. Supports retry logic and optional text dropping.
Usage
Use when you need automatic keyword tagging for documents, enabling topic-based organization, search indexing, and content categorization of training datasets.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/extract_keyword_mapper.py
Signature
@OPERATORS.register_module("extract_keyword_mapper")
class ExtractKeywordMapper(Mapper):
def __init__(self,
api_model: str = "gpt-4o",
*,
keyword_key: str = MetaKeys.keyword,
api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
prompt_template: Optional[str] = None,
completion_delimiter: Optional[str] = None,
output_pattern: Optional[str] = None,
try_num: PositiveInt = 3,
drop_text: bool = False,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.mapper.extract_keyword_mapper import ExtractKeywordMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name, defaults to "gpt-4o" |
| keyword_key | str | No | Key name to store keywords in meta field, defaults to MetaKeys.keyword |
| api_endpoint | Optional[str] | No | URL endpoint for the API |
| response_path | Optional[str] | No | Path to extract content from API response |
| prompt_template | Optional[str] | No | Template of input prompt for keyword extraction |
| completion_delimiter | Optional[str] | No | Marker for end of output |
| output_pattern | Optional[str] | No | Regular expression for parsing keywords |
| try_num | PositiveInt | No | Number of retry attempts on error, defaults to 3 |
| drop_text | bool | No | Whether to drop text from output, defaults to False |
| model_params | Dict | No | Parameters for initializing the API model |
| sampling_params | Dict | No | Extra parameters passed to API call (e.g. temperature, top_p) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with keywords stored in meta field under keyword_key |
Usage Examples
process:
- extract_keyword_mapper:
api_model: "gpt-4o"
try_num: 3
drop_text: false