Implementation:Datajuicer Data juicer ExtractEntityAttributeMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for extracting entity attributes and supporting text from unstructured text provided by Data-Juicer.
Description
ExtractEntityAttributeMapper is a mapper operator that extracts specified attributes for given entities from text using an API-based language model (default: GPT-4o). It constructs prompts from configurable Chinese-language templates that instruct the model to summarize entity attributes and extract representative supporting text excerpts. The structured markdown-formatted response is parsed using regex patterns to extract attribute descriptions and representative examples. Results are stored in metadata under keys for entities, attributes, attribute descriptions, and support texts. Supports retry logic and optional text dropping after extraction. It extends the Mapper base class.
Usage
Import when you need to build structured entity-attribute information from unstructured text for knowledge graph construction or entity-centric data enrichment.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/extract_entity_attribute_mapper.py
Signature
@OPERATORS.register_module("extract_entity_attribute_mapper")
class ExtractEntityAttributeMapper(Mapper):
def __init__(self,
api_model: str = "gpt-4o",
query_entities: List[str] = [],
query_attributes: List[str] = [],
*,
entity_key: str = MetaKeys.main_entities,
attribute_key: str = MetaKeys.attributes,
attribute_desc_key: str = MetaKeys.attribute_descriptions,
support_text_key: str = MetaKeys.attribute_support_texts,
api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
system_prompt_template: Optional[str] = None,
input_template: Optional[str] = None,
attr_pattern_template: Optional[str] = None,
demo_pattern: Optional[str] = None,
try_num: PositiveInt = 3,
drop_text: bool = False,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.mapper.extract_entity_attribute_mapper import ExtractEntityAttributeMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name. Default: "gpt-4o" |
| query_entities | List[str] | No | List of entities to query attributes for. Default: [] |
| query_attributes | List[str] | No | List of attributes to extract for each entity. Default: [] |
| entity_key | str | No | Key name in meta field to store entity names. Default: "entity" |
| attribute_key | str | No | Key name in meta field to store attribute names. Default: "attribute" |
| attribute_desc_key | str | No | Key name in meta field to store attribute descriptions. Default: "attribute_description" |
| support_text_key | str | No | Key name in meta field to store supporting text excerpts. Default: "support_text" |
| api_endpoint | Optional[str] | No | URL endpoint for the API |
| response_path | Optional[str] | No | Path to extract content from the API response |
| try_num | PositiveInt | No | Number of retry attempts on API call error. Default: 3 |
| drop_text | bool | No | Whether to drop the original text after processing. Default: False |
| model_params | Dict | No | Parameters for initializing the API model |
| sampling_params | Dict | No | Extra parameters passed to the API call (e.g. temperature, top_p) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with entity, attribute, attribute_description, and support_text lists added to metadata |
Usage Examples
YAML Configuration
process:
- extract_entity_attribute_mapper:
api_model: gpt-4o
query_entities:
- "protagonist"
query_attributes:
- "personality"
- "appearance"
try_num: 3