Implementation:Datajuicer Data juicer ExtractEntityRelationMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	NLP, Knowledge Graph, Entity Extraction
Last Updated	2026-02-14 16:00 GMT

Overview

Extracts entities and their relationships from text to build knowledge graph structures, using an API-based language model with a structured prompt template adapted from the LightRAG project.

Description

ExtractEntityRelationMapper is a core knowledge graph construction operator that transforms unstructured text into structured entity-relation triples. It uses a detailed prompt template (adapted from LightRAG) that guides an API model through a structured extraction process:

Entity Identification -- Identifies entities with configurable types (default: organization, person, geo, event) and extracts their names, types, and descriptions
Relationship Extraction -- Identifies related entity pairs with relationship descriptions, keywords, and numeric strength scores
Gleaning -- Supports multiple extraction rounds (configurable via max_gleaning) to ensure comprehensive extraction, with an if-loop prompt to determine when to stop
Structured Parsing -- Uses configurable delimiters (tuple, record, completion) and regex patterns to parse the structured output

The prompt template includes detailed instructions with multi-language examples (English and Chinese). Output entities and relations are cached in the sample's metadata under configurable keys (default: entity and relation).

The operator uses an API model (default: gpt-4o) and supports configurable endpoints, response paths, and sampling parameters.

Usage

Use this operator when building knowledge graphs from text data, or when structured entity-relation information is needed for downstream tasks such as retrieval-augmented generation, graph-based analysis, or knowledge-intensive data processing.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/extract_entity_relation_mapper.py
Lines: 1-359

Signature

class ExtractEntityRelationMapper(Mapper):
    def __init__(
        self,
        api_model: str = "gpt-4o",
        entity_types: List[str] = None,
        *,
        entity_key: str = MetaKeys.entity,
        relation_key: str = MetaKeys.relation,
        api_endpoint: Optional[str] = None,
        response_path: Optional[str] = None,
        prompt_template: Optional[str] = None,
        tuple_delimiter: Optional[str] = None,
        record_delimiter: Optional[str] = None,
        completion_delimiter: Optional[str] = None,
        max_gleaning: NonNegativeInt = 1,
        continue_prompt: Optional[str] = None,
        if_loop_prompt: Optional[str] = None,
        entity_pattern: Optional[str] = None,
        relation_pattern: Optional[str] = None,
        try_num: PositiveInt = 3,
        drop_text: bool = False,
        model_params: Dict = {},
        sampling_params: Dict = {},
        **kwargs,
    ):

Import

from data_juicer.ops.mapper.extract_entity_relation_mapper import ExtractEntityRelationMapper

I/O Contract

Inputs

Name	Type	Required	Description
api_model	str	No	API model name. Default: "gpt-4o"
entity_types	List[str]	No	Pre-defined entity types. Default: ["organization", "person", "geo", "event"]
entity_key	str	No	Metadata key for storing entities. Default: MetaKeys.entity
relation_key	str	No	Metadata key for storing relations. Default: MetaKeys.relation
api_endpoint	str	No	URL endpoint for the API
response_path	str	No	Path to extract content from API response
prompt_template	str	No	Custom input prompt template
max_gleaning	int	No	Extra max iterations to glean entities and relations. Default: 1
try_num	int	No	Number of retry attempts on error. Default: 3
drop_text	bool	No	Whether to drop the text in output. Default: False
model_params	Dict	No	Parameters for initializing the API model
sampling_params	Dict	No	Extra parameters for API calls (e.g., temperature, top_p)

Outputs

Name	Type	Description
sample[Fields.meta][entity_key]	list[dict]	List of entity dicts with keys: entity_name, entity_type, entity_description
sample[Fields.meta][relation_key]	list[dict]	List of relation dicts with keys: source_entity, target_entity, relation_description, relation_keywords, relation_strength

Usage Examples

# Basic usage with default model
mapper = ExtractEntityRelationMapper(
    api_model="gpt-4o",
    entity_types=["organization", "person", "geo", "event"],
)

# With custom entity types and more gleaning rounds
mapper = ExtractEntityRelationMapper(
    api_model="gpt-4o",
    entity_types=["person", "technology", "organization", "location", "concept"],
    max_gleaning=3,
    try_num=5,
    sampling_params={"temperature": 0.7},
)

# Process a sample
sample = {"text": "Alex works at OpenAI in San Francisco.", Fields.meta: {}}
result = mapper.process_single(sample)
# result[Fields.meta]["entity"] contains extracted entities
# result[Fields.meta]["relation"] contains extracted relations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment